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Nonlinear  Adaptive  Signal  Processing 
for  Inverse  Control 

Bernard  Widrow  Michel  Bilello* 

Stanford  University  Department  of  Electrical  Engineering,  Stanford,  CA  94305-4055 


Abstract 

A  plant  c^ul  track  an  input  command  signal  if  it  is  driven  by  a  controller  whose  transfer 
function  approximates  the  inverse  of  its  transfer  function.  A  stable  inverse  can  be  obtained  even 
if  the  plant  is  nonminimum-phase.  No  direct  feedback  is  used,  except  that  the  plant  output 
is  monitored  and  utilized  to  adapt  the  parameters  of  the  controller.  A  model-reference  inverse 
control  system  can  learn  to  approximate  a  desired  reference-model  dynaimics. 

Control  of  internal  plant  disturbance  is  au;complished  with  an  optimal  adaptive  disturbance 
canceller.  It  does  not  affect  plant  dynamics,  but  feeds  back  plant  disturbance  in  a  way  that 
minimizes  disturbance  power  at  the  plant  output. 

Similar  principles  can  be  utilized  to  control  nonlinear  systems.  Neural  networks  are  used  to 
build  a  model  of  the  plant  and  to  construct  its  inverse. 


1  Introduction 

This  paper  presents  techniques  for  solving  adaptive  control  problems  by  means  of  adaptive  filtering. 

Many  problems  in  adaptive  control  can  be  divided  into  two  parts:  (a)  control  of  plant  dynamics, 
and  (b)  control  of  plant  disturbance.  Very  often,  a  single  system  is  utilized  to  achieve  both  of 
these  control  objectives.  Our  approach  however  treats  each  problem  separately.  Control  of  plant 
dynamics  can  be  achieved  by  preceding  the  plant  with  an  adaptive  controller  whose  transfer  function 
is  the  inverse  of  that  of  the  plant.  Control  of  plant  disturbance  can  be  achieved  by  an  adaptive 
feedback  process  that  minimizes  plant  output  disturbance  without  altering  plant  dynamics. 

The  principle  of  control  of  plant  dynamics  can  be  extended  to  deal  with  nonlinear  plants.  In 
that  case,  tapped  delay  lines  and  neural  networks  are  used  in  place  of  linear  adaptive  filters. 

2  Adaptive  Inverse  Control  for  Linear  Plants 

2.1  Direct  Plant  Identification 

Adaptive  plant  modeling  or  identification  is  an  important  function.  Fig.  1  illustrates  how  this  can 
be  done  with  an  adaptive  FIR  filter.  The  plsmt  input  signal  is  the  input  to  the  adaptive  filter.  The 
plant  output  signal  is  the  desired  response,  the  target  signal  for  the  filter  output.  The  adaptive 
algorithm,  LMS  [1]  or  RLS  [2],  minimizes  mean  square  error,  causing  the  model  P  to  be  a  best 
least  squares  match  to  the  plant  P  for  the  given  input  signal  and  for  the  given  set  of  parameters 
(weights)  allocated  to  P. 

‘This  work  was  sponsored  by  NSF  under  grant  NSF  IRI  91-12531,  by  ONR  under  contract  no  N00014-92-J-1787, 
and  by  EPRI  under  contract  RP:8010-13 
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Figure  1:  Direct  plant  identification. 

2.2  Inverse  Plant  Identification 

Another  important  function  is  inverse  plant  identification.  This  technique  is  illustrated  in  Fig.  2(a). 
The  plant  input  is  as  before.  The  plant  output  is  the  input  to  the  adaptive  filter.  The  desired 
response  for  the  adaptive  filter  is  the  plant  input  in  this  case.  Minimizing  mean  square  error  causes 
the  adaptive  filter  P~^  to  be  a  best  least  squares  inverse  to  the  plant  P  for  the  given  input  spectrum 
and  for  the  given  set  of  wdghts  of  the  adaptive  filter.  The  adaptive  algorithm  attempts  to  make 
the  cascade  of  plant  and  adaptive  inverse  behave  like  a  unit  gain.  This  process  is  often  called 
deconvolution. 


(a)  (b) 

Figure  2:  Inverse  identification,  (a)  for  minimum-phase  plants,  (b)  for  nonminimum-phase  plants 

For  sake  of  argument,  the  plant  is  assumed  to  have  poles  and  zeros.  An  inverse,  if  it  also  had 
poles  and  zeros,  would  need  to  have  zeros  where  the  plant  had  poles  and  poles  where  the  plant  had 
zeros.  Making  an  inverse  would  be  no  problem  except  for  the  case  of  a  nonminimum-phase  plant. 
It  would  seem  that  such  an  inverse  would  need  to  have  unstable  poles,  and  this  would  be  true  if  the 
inverse  were  causal.  If  the  inverse  could  be  noncausal  as  well  as  causal  however,  then  a  two-sided 
stable  inverse  would  exist  for  all  linear  time-invariant  plants  in  accord  with  the  theory  of  two-sided 
Laplace  trsmsforms. 

A  causal  FIR  filter  can  approximate  a  delayed  version  of  the  two-sided  plant  inverse,  and  an 
adaptive  FIR  filter  can  self-adjust  to  this  function.  The  method  is  illustrated  in  Fig.  2(b).  The 
time  span  of  the  adaptive  filter  (the  number  of  weights  multiplied  by  the  sampling  period)  can  be 
made  adequately  long  so  that  the  mean  square  error  of  the  optimized  inverse  would  be  a  small 
fir2M:tion  of  the  plant  input  power.  To  achieve  this  objective  with  a  nonminimum-phase  plant,  the 
delay  A  needs  to  be  chosen  appropriately.  The  choice  is  generally  not  critical  however. 

The  inverse  filter  is  used  as  a  controller  in  the  present  scheme,  so  that  A  becomes  the  response 
delay  of  the  controlled  plant.  Making  A  small  is  generally  desirable,  but  the  quality  of  control 
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depends  upon  the  accuracy  of  the  inversion  process  which  sometimes  requires  A  to  be  of  the  order 
of  half  the  length  of  the  adaptive  filter,  or  less. 

A  model-reference  inversion  process  is  shown  in  Fig.  3.  A  reference  model  is  used  in  place  of 
the  delay  of  Fig.  2(b).  Minimizing  mean  square  error  with  the  system  of  Fig.  3  causes  the  cascade 
of  the  plant  and  its  “model- reference  inverse”  to  approximate  closely  the  response  of  a  model  M. 
Much  is  known  about  the  design  of  model  reference  systems  [3].  The  model  is  chosen  to  give  a 
desirable  response  to  the  overall  system.  Some  delay  may  need  to  be  incorporated  into  the  model 
in  order  to  achieve  low  error. 


Figure  3:  Model-reference  plant  inverse. 


2.3  Adaptive  Control  of  Plant  Dynamics 

Now  having  the  plant  inverse,  it  can  be  used  as  a  controller  to  provide  a  driving  function  for  the 
plant.  This  simple  idea  is  illustrated  in  Fig.  4(a)  for  minimum-phase  plants.  Fig.  4(b)  shows 
the  corresponding  scheme  for  nonminimum-phase  systems.  Many  simulation  examples  have  been 
performed,  with  consistently  good  results,  as  long  as  the  plant  is  stable  or  is  first  stabilized  by 
feedback.  Extensive  analysis  will  be  presented  in  the  forthcoming  book  by  Widrow  and  Walach  [4]. 


(a)  (b) 

Figure  4:  Inverse  control  scheme,  (a)  for  minimum  phase  plants,  (b)  for  nonminimum-phase  plants. 


in-5 


2.4  Adaptive  Plant  Disturbance  Cancelling 

The  systems  of  Fig.  4  only  control  and  compensate  for  plant  dynamics.  The  disturbance  appears 
at  the  plant  output  unabated.  The  only  way  that  the  plant  output  disturbance  can  be  reduced  is 
to  obtain  this  disturbance  from  the  plant  output  and  process  it,  then  feed  it  back  into  the  plant 
input.  The  system  shown  in  Fig.  5  does  this. 


Figure  5:  Disturbance  cancelling  system. 

In  Fig.  5,  an  exact  copy  of  P  is  fed  the  same  input  signal  as  the  plant  P.  The  output  of  this 
P  copy  is  subtracted  from  the  plant  output.  Assuming  that  P  has  a  dynamic  response  essentially 
identical  to  that  of  the  plant  P,  the  difference  in  the  outputs  is  a  close  estimate  of  the  plant 
disturbance.  This  disturbance  is  filtered  by  Q  and  then  subtracted  from  the  plant  input.  The  filter 
Q  is  generated  by  an  off-line  process  that  delivers  new  values  of  Q  almost  instantaneously  with  new 
values  of  P,  which  adapts  continually  to  keep  up  with  changes  in  the  plant  P. 

The  filter  Q  is  essentially  the  best  inverse  (without  delay)  of  P.  The  synthetic  disturbance  used 
to  train  Q  should  have  a  spectral  character  like  that  of  the  plant  disturbance.  It  is  shown  in  the 
Widrow  and  Walach  book  [4]  that  the  disturbance  cancelling  system  of  Fig.  5  adapts  and  converges 
to  minimize  the  plant  disturbance  at  the  plant  output.  As  such,  it  is  an  optimal  linear  least  squares 
system.  There  is  no  way  to  further  reduce  the  plant  disturbance. 

The  system  of  Fig.  5  appears  to  be  a  feedback  system.  However,  if  P  is  dynamically  the  same  as 
P,  the  transfer  function  around  the  loop  is  zero.  The  transfer  function  from  the  Plant  Input  point 
to  the  Plant  Output  point  is  the  same  as  that  of  the  plant  alone.  Thus,  the  disturbance  canceller 
does  not  affect  the  plant  dynamics. 

Almost  perfect  disturbance  cancellation  is  possible  with  a  minimum-phase  plant.  With  a 
nonminimum-phase  plant,  even  optimal  cancelling  will  not  cancel  all  the  disturbance. 

2.5  Example 

A  simulation  experiment  has  been  done  to  illustrate  the  effectiveness  of  the  inversion  process.  Fig.  6 
shows  the  impulse  response  of  a  nonminimum-phase  plant  having  a  small  transport  delay.  Fig.  7(a) 
shows  the  impulse  reponse  of  the  best  least  squares  inverse  with  a  delay  of  A  =  50  sample  periods. 
Fig.  7(b)  is  a  convolution  of  the  plant  and  its  inverse  impulse  response.  The  result  is  essentially  a 
unit  impulse  at  a  delay  of  50. 

Fig.  8  show  results  of  a  plant  disturbance  cancellation  experiment.  Although  the  plant  in  this 
case  was  nonminimum-phase,  the  results  are  quite  good. 
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Figure  6:  Impulse  response  of  nonminimum-phase  plant. 


Figure  8:  (a)  Output  of  undisturbed  plant  and  output  of  disturbed  plant,  with  and  without  dis¬ 
turbance  canceller,  (b)  Instantaneous  power  of  plant  output  disturbance.  (Canceller  turned  on  at 
k  =  300.) 


3  Nonlinear  Inverse  Control 

The  principles  of  inverse  control  can  be  extended  to  deal  with  nonlinear  systems.  Nonlinear  systems 
behave  quite  differently  from  their  linear  counterparts.  A  major  difference  is  that  whereas  a  linear 
system  possesses  a  unique  inverse,  nonlinear  systems  have  only  local  inverses  if  at  all,  valid  only  in 
a  bounded  region  of  the  signal  space.  As  linear  adaptive  filters  are  used  to  control  linear  plants,  the 
inverse  contndler  for  nonlinear  plaints  involves  a  type  of  recurrent  neural  network.  The  ability  of 
multilayered  neuraJ  networks  to  approximate  nonlinear  mappings  over  compaM:t  regions  as  detailed 
in  [5]  makes  them  useful  in  identifying  direct  amd  inverse  models. 

The  inverse  control  of  nonlineair  plamts  involves  a  two-stage  process  where  a  model  of  the  plant 
is  first  constructed  (identification)  amd  second  the  plamt  model  is  inverted. 

3.1  Plant  Identification 

The  system  is  modelled  through  the  use  of  a  feedforwaird  multilayered  neural  network  fitted  with 
tapped  delay  lines  at  its  input  and  output  and  a  feedbauk  loop.  This  is  the  nonlinear  equivalent 
of  a  lineaur  IIR  filter.  With  am  appropriate  number  of  hidden  neurons,  such  a  neural  network  cam 
represent  a  system  of  the  form 

Vk  =  /(y*- 1 ,  yk-2,  •  • . ,  y*-n,  «*-i ,  •  •  • , 

over  a  bounded  re^on  of  input  space.  The  choice  of  the  integers  n  and  p  is  paurt  of  the  modelling 
design  and  follows  from  requirements  of  model  accuracy.  The  identification  scheme,  illustrated 
in  Fig.  9,  is  founded  on  a  standard  technique,  which  is  the  nonlinear  equivalent  of  the  equation- 
error  formulation  described  in  [6],  and  is  caJled  a  series-patxUlel  model  in  [7].  The  choice  of  this 
formulation  allows  the  use  of  the  stamdard  backpropagation  algorithm  suited  to  training  feedforward 
neural  networks. 


3.2  Computing  the  Inverse 

The  second  step  is  the  design  of  the  controUer.  Once  the  plant  identification  has  been  performed 
and  a  model  of  the  plamt  obtauned,  the  controller,  adso  implemented  ais  a  recurrent  neural  network. 
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is  trained  to  behave  like  the  inverse  of  the  system.  The  algorithm  used  for  training  the  controUer  is 
a  variant  of  the  recurrent  backpropagation  algorithm  [8].  The  controller  is  trained  upstream  from 
the  plant  model  and  the  error  signal  is  backpropagated  through  the  plant  model  as  illustrated  in 
Fig.  10. 


Newton’s  method  Although  the  previous  algorithmic  scheme  performs  well,  it  is  advantageous 
to  consider  an  alternate  procedure  based  on  Newton’s  method  for  solving  iionlinear  equations. 

The  motivation  for  this  procedure  is  the  desire  to  increase  speed  and  precision  by  training 
the  inverse  of  the  plant  model  using  standard  backpropagation  as  opposed  to  a  recurrent  version 
thereof.  To  do  so,  a  desired  controller  output  needs  to  be  derived. 

The  desired  controUer  output  is  simply  the  input  to  the  plant  that  would  yield  the  desired  plant 
output.  Thus,  we  want  to  solve  the  equation  =  /■«,  where  y^'^  is  the  target  (desired)  signal 
and  T  is  the  nonlinear  discrete-time  plant  mapping. 

Let  A  be  the  first  derivative  of  T  evaluated  at  the  origin.  We  iteratively  apply  the  foUowing 
algorithm: 


«(’*+!)  = 

In  the  above  equations,  the  sequences  are  finite  for  obvious  practical  reasons.  However,  it 
should  be  pointed  out  that  under  certain  conditions,  the  convergence  of  this  algorithm  is  guaranteed 
even  in  the  infinite  dimensional  case.  Of  course,  no  matrix  inversion  needs  to  be  performed  here.  In 
fact,  since  A  is  lower-triangular  in  virtue  of  causality,  only  a  triangular  system  of  linear  equations, 

is  solved  at  each  time  step  by  forward  substitution. 

The  procedure  is  carried  out  according  to  Fig  11.  Note  that  the  are  entire  sequences  and 
not  time  samples.  Data  is  gathered  from  the  actual  plant  and  then  processed  off-line  to  supply  the 
next  input  sequence.  If  the  inverse  of  the  plant  model  is  to  be  computed,  the  whole  computation 
can  be  performed  off-line  with  the  plant  model  in  place  of  the  physical  plant.  At  the  end  of  the 
Iterative  procedure,  is  obtained,  the  desired  output  for  training  the  inverse  controller. 


in-9 


Figure  11:  Scheme  to  apply  Newton’s  method  for  computing  plant  input. 

3.3  Disturbance  Cancelling 

As  for  linear  systems,  a  feedback  scheme,  illustrated  in  Fig.  12  and  called  Internal  Model  Control 
can  be  implemented  which  tends  to  eliminate  plant  output  disturbance.  After  being  thoroughly 
studied  in  the  context  of  linear  systems  ([9]),  an  extension  to  nonlinear  systems  has  been  suggested 
([10]).  The  underlying  idea  is  the  same  as  in  the  linear  case.  An  estimate  of  the  output  disturbance 
is  produced  by  comparing  the  plant  output  and  the  plant  model  output.  The  estimated  disturbance 
is  then  fedback  to  the  controller  input.  It  turns  out  that  if  the  closed-loop  system  is  stable,  and  if 
the  controller  is  chosen  to  be  the  inverse  of  the  plant  model,  then  the  disturbance  can  be  cancelled. 


A 

dt 


Figure  12:  Internal  Model  Control  for  cancelling  output  disturbance. 


3.4  Examples 

Example  1  Let’s  consider  the  nonlinear  plant  suggested  in  [7]  and  defined  by  the  equation: 


Vk  = 


Vk-i 

1  +  yU 


+  t*Li 
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The  input  signal  is  confined  in  the  interval  [—2,2].  The  plant  model  in  training  configuration 
(Fig.  9)  had  two  inputs,  the  external  input  u  and  the  output  from  the  real  plant,  and  one  output.  It 
had  one  hidden  layer  with  10  units.  The  result  of  the  plant  identification  is  displayed  in  Fig.  13(a). 
Specifically,  the  outputs  of  the  plant  and  plant  model  are  compared  when  the  same  test  signal  is 
fed  to  their  inputs. 


(a)  (b) 

Figure  13:  Example  1.  (a)  Result  of  plant  identification  (b)  Performance  of  inverse  controller. 

The  neural  network  controller  had  a  two-tap  tapped  delay  line  as  input,  a  hidden  layer  with 
10  units  and  one  output  which  is  fed  to  the  plant  model.  The  error  is  backpropagated  through 
the  plant  model  using  on-line  recurrent  backpropagation.  The  time  plots  of  Fig.  13(b)  show  the 
command  input  fed  to  the  trained  inverse  controller  and  the  plant  output.  Although  there  are 
errors,  the  agreement  between  the  two  signals  is  very  good.  The  important  thing  to  note  is  that 
the  controller  is  trained  to  be  an  inverse  to  the  plant  model  and  not  the  plant  itself.  Consequently, 
good  performance  of  the  controller  is  contingent  on  building  an  accurate  model  for  the  plant. 

Next,  we  demonstrate  the  efficacy  of  the  Internal  Model  Control  in  cancelling  the  output  dis¬ 
turbance.  The  inverse  controller  and  the  plant  model  have  been  inserted  in  the  control  structure 
according  to  Fig.  12.  The  result  of  this  experiment  in  the  form  of  the  power  of  the  disturbance  at 
the  plant  output  before  and  after  the  feedack  loop  is  closed  is  shown  in  Fig.  14. 


Figure  14:  Instantaneous  power  of  disturbsmce  at  plant  output  (feedback  loop  closed  at  it;  =  50). 
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Example  2  As  a  second  example,  we  modify  the  previous  system  equation  and  now  consider: 

Vk  =  +  sin(«jt-i) 

^  +  Vfc-l 

We  will  use  this  example  to  illustrate  the  use  of  Newton’s  method  to  train  the  inverse  controller. 
We  start  with  the  process  of  plant  identification.  The  plant  model  neural  network  has  similar 
characteristics  to  the  previous  example’s.  Using  a  random  signal  uniformly  distributed  in  the 
interval  [—1, 1],  we  obtmn  the  results  displayed  in  Fig.  15. 


Figure  15:  Result  of  nonlinear  plant  identification  for  example  2. 

Next,  the  inverse  model  is  train  using  the  two  different  methods  discussed  earlier.  The  inverse 
controller  is  then  placed  in  cascade  with  the  real  plant  to  evaluate  trau:king  performance.  The  results 
in  Fig.  16(a)  were  produced  by  an  inverse  controller  trained  with  standard  backpropagation  after 
the  desired  controller  output  had  been  solved  for  using  the  Newton-like  method.  By  comparison, 
an  inverse  controller  trained  with  recurrent  backpropagation  through  the  plant  model  yielded  the 
results  of  Fig.  16(b).  The  plots  demonstrate  better  performance  with  the  former  method. 


(•)  (b) 

Figure  16:  Tracking  performance  for  plant  of  example  2.  (a)  using  Newton  method  and  standard 
backpropagation.  (b)  using  recurrent  backpropagation. 
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4  Conclusion 


Methods  for  adaptive  control  of  plant  dynamics  and  for  control  of  plant  disturbance  for  unknown 
linear  plants  have  been  described.  In  addition  extension  of  control  of  plant  dynamics  to  nonlinear 
plants  using  neural  networks  have  been  presented.  For  their  proper  application,  the  plant  must  be 
stable.  An  unstable  plant  could  first  be  stabilized  with  feedback,  then  adaptively  controlled. 
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Abstract 

Some  neural  network  problems  may  seem  to  require  a  large  frame  size  compared  to  frame  number,  which 
would  ordinarily  lead  to  large  statistical  problems.  Given  the  right  assumptions,  a  sliding  architecture  may  be  used 
to  overcome  th^  dimculties.  Sometimes  these  assumptions  are  partially  broken  thus  producing  non-optimum 
results.  This  paper  describes  one  such  case  where  some  of  the  input  signals  are  of  unknown  calibration.  A  soluticm 
is  suggested  and  then  put  to  the  test  using  an  artificial  problem.  This  produced  up  to  a  factor  of  10  improvement  on 
the  errors  when  compared  to  the  case  of  a  simple  sliding  architeaure. 


1  iNTRODUCli./N 


Take  a  typical  multi-layer  peicepnon  problem:  an  input  x{k)  is  provided  where  k  indicates  which  frame  of 
data  the  input  is  t^en  from;  and  a  required  output  y’^%k)  is  provided  to  which  the  actual  output  of  the  net  y(il;)  must 
converge  through  the  iterative  setting  of  the  weights  of  the  network.  Let  the  maximum  number  of  available  frames 
m  be  much  smaller  than  the  number  of  elements  of  the  input  vector  n/.  This  would  normally  lead  to  severe 
statistical  errors  (including  over-learning  in  an  mlp  [Chauvin  1990])  without  a  redesign  of  the  network  architecture 
or  of  the  data  set  For  a  given  problem,  it  may  be  possible  to  make  the  following  two  approximations. 

The  local  approximation:  can  only  be  associated  with  =  (x/  -  4j  /  2-  •  •  JCi  +  41  /  2). 

The  location  independence  approximation;  the  mapping  formed  between  the  two  is  by  the  nature 

of  the  problem  independent  of  i. 

If  this  is  the  case  then  a  sliding  architecture  can  be  used  (see  figure  1).  This  improves  the  ratio  of  frame 
size  to  frame  number  by  a  factor  of  about  ny^/4/.  This  turns  a  large  number  of  inputs  into  an  advantage  by 
producing  many  more  frames  from  them.  Such  an  architecture  is  equivalent  to  the  windowing  of  visual  or  other 
data  for  classification  problems  such  as  character  recognition.  In  Hand  et  al.  [1992]  this  type  up  break  down  is  used 
to  retrieve  the  positions  of  the  eyes  in  the  picture  of  a  face.  For  many  problems  this  is  the  simple  and  obvious 
iq)[»oach.  Although  its  use  here  is  intended  for  function  estimation  problems  rather  than  classification,  this  is  not  a 
novel  architecture. 
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Figure  1 


Sliding  Network  Architecture 


Because  of  the  large  improvement,  it  can  be  templing  to  use  this  architecture  even  if  the  two 
iqiproximations  are  broken  to  some  extent.  However,  it  may  often  be  possible  to  adapt  the  architecture  so  that  this 
frame  size  to  number  ratio  improvement  is  still  obtained,  while  still  complying  with  the  approximations. 


in- 14 


r 


This  case  arose  from  an  application  of  neural  networks  to  the  measurement  of  the  sound  pressure  response 
(rf  small  rooms  [Allison  1972,  1976;  Bodlund  1976;  Craik  1990].  Four  sets  of  input  data  and  one  set  of  required 
ouqxit  was  available  with  1024  frequency  points  per  signal  (and  so  i  represented  the  frequency).  Because  the 
number  of  rooms  with  which  to  provide  frames  of  data  would  always  be  small  (in  this  case  48,  half  of  which  would 
have  to  be  reserved  for  testing),  there  is  an  inherent  problem  with  the  frame  size  to  number  ratio.  However,  it  was  a 
reasonable  a|q;m>ximation  to  make  that  an  output  at  frequency  i  could  only  be  associated  with  inputs  of  frequency 
close  to  i.  This  width  of  input  was  deflned  by  the  resonances  present  in  a  room  [Kuttruff  1973,  cluqrter  III].  Each 
mode  of  resonance  had  a  certain  finequency  bandwidth  over  which  it  could  be  excited,  and  this  bandwidth  provkles 
the  possibility  of  associating  the  output  with  inputs  of  slightly  different  frequency.  This  worits  out  to  around  10 
Crequency  points  out  of  1024.  Given  there  were  four  sets  of  inputs,  each  output  h^  to  be  associated  to  40  inputs. 
Although  there  were  problems  in  doing  so,  it  was  also  useful  to  use  the  location  independence  a{^oximation 
because  this  lead  to  a  network  that  would  extrapolate  for  rooms  of  different  sizes. 

Thus  these  two  approximations  led  to  the  use  of  a  sliding  network  architecture,  where  the  network  slid 
over  the  frequency  range  steadily  filling  in  the  outputs.  In  doing  so,  the  frame  size  to  number  ratio  was  improved  by 
the  Older  of  100,()00  enabling  the  use  of  neural  networks  as  a  solution. 

However,  in  this  example  the  calibration  of  three  of  the  inputs  was  uncertain.  This  Iweaks  the  location 
independence  ^proximation.  For  a  fully  connected  network,  the  uncalibrated  nature  of  the  input  signal  does  not 
present  a  problem  because  the  network  can  calibrate  itself  to  the  input  signal.  This  is  not  possible  with  a  sliding 
network  b^use  the  calibration  is  dependent  on  t,  whereas  the  network  is  independent  of  i. 


2  SOLUTION 

In  this  case,  the  solution  is  quite  simple  and  effective.  Assuming  that  the  required  calibration  for  input  x/  is 
essentially  linear  and  independent  of  all  other  inputs,  then  the  calibration  of  the  input  can  be  carried  out  by  what 
will  be  termed  a  calibration  layer.  A  calibration  layer  is  a  layer  of  single  links  that  processes  the  uncalibrated  input 
nodes  to  i»oduce  a  new  set  of  nodes  representing  a  calibrated  input;  Xi  +  d,-.  The  sliding  network  can 

then  slide  along  this  new  set  of  nodes  oblivious  of  the  calibration.  Thus  these  links  are  fixed  to  each  input  instead  of 
sliding  along  with  the  net,  and  can  therefore  provide  a  processing  dependent  on  t.  In  principle,  the  required  ouqxit 
can  represent  a  signal  also  of  unknown  calibration,  and  so  an  output  calibration  layer  can  also  be  us^.  This  new 
architecture  is  shown  in  figure  2. 


oo 

oo 


oo 
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Figure  2 


Sliding  Network  with  Calibration  Layers 


Assuming  that  the  sliding  net  would  ordinarily  be  taught  using  some  method  of  gradient  descent  of  back- 
propagated  errors  [Rumelhart,  McClelland  1986;  Werbos  1988],  this  learning  scheme  can  easily  be  extended  to 
train  the  calibration  layers.  Thus  while  the  network  is  positioned  at  output  y,-  and  frame  k  the  output  calibration  link 
to  y,-  can  be  taught  as  part  of  the  net,  and  also  those  input  calibration  links  connected  to 
There  are  three  points  to  note. 
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Although  each  ou^t  calibration  link  will  only  have  k  frames  from  which  to  learn,  and  each  input  link  kAi 
Crames,  this  will  not  result  in  the  learning  problems  encountered  before  the  use  of  a  sliding  net  architecture,  because 
each  calibrated  node  only  has  access  to  a  single  input. 

Although  the  a^itecture  shown  in  figure  2  is  a  network  containing  four  layers  of  weights,  the  problems 
nonnally  associated  with  deep  nets  when  using  backpropagation  will  not  be  present.  This  is  because  the  errors 
back-propagated  through  the  output  calibration  links  are  not  dispersed  by  the  layer,  and  so  remain  specific  to  a 
given  ouQHit 

The  weights  of  a  feedforward  network  are  usually  initialised  at  random.  This  is  because  it  is  not  possible  to 
say  what  the  iKxles  that  these  weights  connect  to,  represent  However,  in  the  case  of  the  calibration  layers,  the  roles 
of  the  imdes  involved  is  well  defined.  Thus  the  wei^ts  and  thre^olds  can  be  initialised  according  to  the  means  and 
standard  deviatkms  (with  respea  to  the  training  set  of  frames)  of  the  inputs  and  required  output 

3  EXPERIMENT 

To  test  the  performance  of  such  a  system,  an  artificial  problem  was  devised. 

The  input  was  20  floating  point  numbers  set  to  rectangular  noise  between  0  and  1.  The  required  ouqHit  was 
set  according  to: 

=  a) 

Three  other  vectors  of  20  elements  were  also  provided  as  input  These  were  also  set  with  rectangular  nrase. 
The  requited  ouqrut  was  unrelated  to  these  other  inputs.  This  increased  the  size  of  the  input  so  that  it  was 
compai^le  to  the  number  of  frames  used  and  so  forces  the  use  of  a  sliding  architecture. 

These  vectors  formed  one  frame  of  data.  There  were  46  such  frames  constructed  for  the  purposes  of 

training. 

From  the  input  layer  upwards,  the  network  was  constructed  as  follows. 

Bad)  input  node  was  connected  by  a  permanent  calibration  link  to  a  calibrated  input  ruxle.  There  was  no 
output  function  on  the  activation  of  these  nodes. 

The  first  layer  of  the  sliding  net  connects  4  inputs  from  each  input  vector  (16  in  total)  to  the  7  nodes  of  the 
first  layer  which  performed  a  weighted  summation  on  these  connections  followed  by  a  logistic  sigmoid  ouq>ut 
function  to  provide  the  network  with  a  non-linear  response. 

These  nodes  lead  on  to  the  second  layer  which  consisted  of  just  one  node  operating  another  weighted 
summatitm,  but  this  time  without  a  further  output  function. 

This  node  then  connected  to  one  of  the  links  of  the  output  calibration  layer.  The  nodes  of  the  ouqnit 
calibration  layer  had  no  output  function. 

The  initialisation  procedure  for  the  calibration  layers  described  earlier  was  used.  This  served  to  produce 
the  unknown  calibration  of  the  input  and  output.  Because  this  was  carried  over  the  46  frames  of  the  training  set,  this 
lead  to  a  statistical  em»'  in  this  initialisation  of  about  15%.  The  test  will  work  if  the  network  can  remove  this  error. 

All  other  weights  and  thresholds  were  initialised  randomly  according  to  a  rectangular  distribution  between 
-1  and+1. 

Learning  was  carried  out  using  backpropagation  with  momentum  [Rumelhart,  McClelland  1986;  Jacobs 
1988]  throughout  the  network.  A  constant  of  0.9  was  used  for  the  momentum.  The  momentum  term  was  initialised 
to  zero  in  all  cases. 

The  learning  rates  were  applied  layer  by  layer  producing  four  learning  rates:  r,n,  ri,  ri,  rput',  in  that  otdet 
finmn  input  to  ouqurL 

4  Results 

Best  results  were  obtaining  with  learning  rates  of  about:  =  0.001,  ti  =  0.1,  ri  =  0.1,  r^ut  =  0.001. 
Figures  3  and  4  compare  the  learning  curves  for  these  values  of  the  calibration  learning  rates  against  the  case  of  no 
learning  in  the  calibration  layers.  These  results  were  obtained  using  a  further  frame  of  data,  not  part  of  the  46 
frames  of  the  learning  set,  thus  requiring  tite  network  to  predict.  Note  that  the  values  used  for  the  other  rates  not 
specified  in  the  graphs  are  those  given  above. 

These  results  show  a  factor  of  ten  improvement  in  the  network’s  prediction  through  using  input  calibration 
learning,  and  a  factor  of  two  improvement  using  output  calibration  learning. 
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Rgure  3  Effect  of  Input  Calibration 

Values  shown  are  for 


Figure  4  Effect  of  OuQrut  Calilwation 

Values  shown  are  for 


5  Conclusions 

The  calibration  learning  proved  to  be  an  effective  method  of  introducing  a  input  node  specific  adjustment 
dqjeodeot  cm  positirai  without  reintroducing  the  frame  size  to  number  problem. 

The  values  of  the  calibration  learning  rates  was  smaller  than  that  of  the  main  sliding  part  of  the  net  by  a 
factor  of  100.  Any  rates  much  higher  than  this  produced  results  worse  than  the  case  of  no  calituadoo  tgaming  it 
^tpeats  that  the  Vibration  has  to  wait  for  the  main  network  to  produce  a  rough  mailing  before  it  can  start  to 
converge,  whilst  the  main  network  can  start  to  converge  before  the  calibration  layers.  Although  once  the  calibration 
layers  have  started  to  learn,  diis  should  feed  back  to  enable  the  main  net  to  leant  a  more  accurate  mapping,  the 


m-17 


1 


learning  in  tbe  calibration  layers  would  remain  subordinate  to  the  main  net.  Thus  this  method  can  only  be  used  if 
either  the  calilMation  of  tbe  original  signal  is  not  too  poor,  or  if  there  are  enough  frames  (njt)  for  tbe  initialisation 
procedure  used  above  to  be  valid.  If  this  is  not  the  case,  then  it  would  be  difficult  for  an  initial  mapping  to  be 
formed,  and  so  neither  part  of  the  system  would  converge. 

Because  of  the  width  of  the  input,  Ai,  is  larger  than  the  width  of  the  output  (which  is  just  1),  tbe  network  as 
described  above  would  not  be  capable  of  producing  an  output  at  the  extremes  of  i  (i  «  0  and  i  ••  n/).  There  are  two 
ways  around  this. 

The  simplest  is  just  to  measure  more  data  at  the  ends  to  ensure  enough  input  is  available  to  cover  for  tbe 
required  range  of  tbe  output 

If  this  is  not  an  option,  then  the  architecture  can  be  adapted  to  give  an  output  for  the  ends.  As  a  starting 
point  take  tbe  fully  trained  network,  trained  using  only  one  output  as  in  figure  2.  Place  the  netwcHk  at  i^Ai/ 2. 
Coiuiect  each  of  tbe  pre-ultimate  layer  nodes  to  each  of  the  output  nodes  below  Ai  /  2.  No  further  calitaation  links 
are  needed  for  these  connections.  Train  these  new  links  using  the  delta  rule  while  keeping  all  previously  trained 
links  constant  Repeat  this  for  i  =  -Ai/2.  This  extension  to  the  architecture  does  not  make  use  of  the  location 

independence  iq)proximation  (which  is  why  no  output  calibration  links  are  necessary).  The  number  of  frames 
available  to  carry  out  this  end-learning  is  therefore  reduced  back  to  njt.  However,  because  these  links  are  using 
nodes  trained  on  data  from  across  the  input  vector,  the  information  presented  by  these  nodes  represent  those 
features  already  found  to  be  of  most  use  to  the  network. 

In  the  application  to  the  measurement  of  room  acoustics,  the  calibration  layers  also  had  tbe  role  of 
iKxmalising  the  data  so  that  the  main  (sliding)  network  received  the  input  and  required  output  data  roughly  in  the 
range  of  0  to  1  instead  of  the  range  of  about  -90  to  -50.  This  presented  some  problems  not  encountered  in  tbe 
experiment  outline  earlier.  To  get  a  convergent  behaviour  from  the  calibration  layers,  the  threshold  adjustments  had 
to  be  scaled  down  on  the  input  calibration  links.  Also,  the  momentum  term  had  to  be  taken  off  the  threshold  update 
rule. 
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Abstract 

In  this  paper,  we  propose  a  Partial  Recurrent  Time  Delay  Neural  Network.  The  architecture  is 
quite  similar  to  a  Recursive  Adaptive  filter  and  can  be  trained  as  a  channel  equalizer.  Results  show 
that  the  proposed  architecture  gives  good  results  for  several  simulation  channels.  For  tests  conducted, 
it  outperforms  the  DFE  equalizer  and  the  best  available  neural  equalizer.  For  a  nonminimum  phase 
channel,  the  neural  equalizer  performs  well,  while  traditional  equalizers  do  not  perform  well. 


1  Introduction 

The  problem  to  be  considered  in  channel  equalization  is  that  of  utilizing  the  information  represented  by 
the  observed  channel  output  to  produce  an  estimate  of  the  input  symbol  xt-r-  A 

system  which  performs  this  function  is  known  as  an  equalizer,  which  compensates  for  unwanted  channel 
features  and  presents  the  receiver  with  a  sequence  of  samples  that  have  in  a  sense  “cleaned-up”  the  effects  of 
Intersymbol  Interference  (ISI)  and  noise.  Equalizers  can  be  classified  into  two  categories;  the  symbol-decision 
and  sequence-estimation  equalizer  [8].  A  linear  transversal  equalizer(LTE)t  is  a  symbol-decision  equalizer  as 
the  operation  of  this  equalizer  at  each  sample  t  is  based  on  the  m  most  recent  channel  observations.  A  decision 
is  made  regarding  the  transmitted  symbol  at  sample  I  —  r.  The  integer  m  and  r  are  known  as  the  equalizer 
order  and  delay  respectively.  A  powerful  technique  to  improve  the  performance  of  the  symbol-decision 
equalizer  is  to  include  past  detected  symbols  into  its  input  vector,  this  equalizer  is  called  a  decision  feedback 
equalizer  (DFE)  [8].  The  best  known  sequence-estimation  equalizer  is  the  maximum  likelihood  sequence 
estimator  (MLSE)  [8].  The  MLSE  is  optimal  for  detecting  the  entire  transmitted  sequence  and  provides  the 
best  attainable  performance  for  any  equalizer  [2].  High  complexity  zuid  the  deferring  of  decisions  are  two 
drawbacks  of  the  MLSE.  Although  the  concept  of  adaptive  equrdization  has  been  known  for  many  years  [9], 
neural  networks  have  only  recently  being  used  as  nonlinear  adaptive  filters  for  the  channel  equalization 
problem  [6].  Some  researchers  have  use  neural  networks  as  a  channel  equalizer  by  using  basic  neural  network 
architectures  and  algorithms  [3]  [2].  The  Time  Delay  Neural  Network  architecture  has  not  been  used  in 
channel  equalization  before. 

In  the  paper,  we  will  extend  the  Time-Delay  Algorithm  to  a  partial  recurrent  Time-Delay  algorithm.  The 
new  architecture  and  algorithm  are  more  flexible  than  the  original  models,  and  can  implement  more  powerful 
functions.  In  order  to  avoid  overfitting  the  training  data,  penalty  functions  are  used.  After  estimating  the 
channel  characteristics,  the  neural  networks  will  be  used  as  a  channel  equalizer.  Our  neural  channel  equalizer 
is  tested  on  these  channels  and  comparison  results  are  made  with  other  neural  channel  equalizers  and  the 
DFE. 

2  Motivation  and  Proposed  Architecture 

In  this  paper,  we  will  mainly  discuss  the  partial  recurrent  network  instead  of  the  fully  recurrent  architecture. 
We  use  a  partially  recurrent  architecture  since  these  networks  can  incorporate  information  about  past  states 

*The  authors  acknowledge  support  from  NSF  grant  EET-8857711 
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and  learning  algorithms  are  much  simpler  than  full  recurrent  networks.  The  algorithms  that  we  proposed 
are  similar  to  the  Jordan  Elman  network  [5]  shown  in  Figure  1.  In  the  partial  recurrent  architecture,  the 
input  layer  is  divided  into  two  parts:  the  true  input  units  and  the  context  units  (the  feedback  units)  [13]. 
The  context  units  simply  hold  a  copy  of  the  value  of  state  variables  from  the  previous  time  step. 


Figure  1;  Partial  Recurrent  Neural  Network  Architecture 

The  modifiable  connections  are  all  feedforward,  and  can  be  trained  by  a  feedforward  learning  scheme.  If 
the  state  variables  are  delayed  values  of  the  output  from  the  output  layer,  the  state  variables  accumulate 
a  weighted  moving  average  of  the  past  values  they  see.  The  architecture  can  be  used  to  implement  a  non¬ 
linear  Auiortgnasive  Moving  Average  (ARM A)  predictor  [12]  or  a  non-linear  recursive  adaptive  equalizer. 
The  feedforward  network  is  a  special  case  of  the  recurrent  form,  which  can  be  used  to  implement  a  non-linear 
traversal  adaptive  equalizer  [3].  The  recurrency  in  a  partial  recurrent  network  lets  the  network  remember 
cues  from  the  recent  past,  but  does  not  appreciably  complicate  the  training  as  real-time  recurrent  learn- 
ing(RTRL)  proposed  by  Williams  and  Zipser  [5]  and  time- dependent  recurrent  back-propagation  proposed  by 
Pearlmutter  [7]. 

A  Time  Delay  Neural  Network  (TDNN)  [11]  is  typically  described  as  a  layered  network  in  which  the 
outputs  of  a  layer  are  buffered  several  time  steps  and  then  fed  fully  connected  to  the  next  layer.  The 
TDNN  architecture  can  be  viewed  as  an  FIR  filter  network,  i.e.  each  connection  in  the  static  feedforward 
architecture  becomes  an  FIR  filter  [12]. 

In  a  partial  recurrent  architecture,  delayed  versions  of  computational  nodes  can  serve  as  inputs  to  the 
network.  We  will  consider  only  the  case  where  delayed  versions  of  output  nodes  are  inputs  to  the  network. 
This  network  is  a  nonlinear  version  of  a  linear  infinite  impulse  response  (HR)  filter.  The  proposed  Partial 
Recurrent  Time  Delay  Neural  Network  (PRTDNN)  architecture  is  a  TDNN  with  partial  recurrent  architec¬ 
ture.  The  TDNN  proposed  by  Waibel  [11]  is  a  special  case  of  a  PRTDNN.  The  PRTDNN  is  trained  using  a 
variation  of  the  backpropagation  algorithm  using  regularization  methods. 

The  simplest  weight  regularization  method  is  to  use  an  exponential  weight  decay.  While  this  method 
discourages  use  of  large  weights;  the  penalty  of  one  large  weight  is  much  more  than  many  small  ones.  This 
can  be  cured  by  using  a  different  penalty  term,  such  as  [5]  which  normalizes  the  effects  of  different  magnitude 
weights  by  decaying  larger  magnitude  weights  more  rapidly.  This  method  is  called  the  weight  elimination 
method  and  is  used  in  conjunction  with  the  backpropagation  algorithm. 

We  note  that  the  PRTDNN  architecture  has  the  following  advantages  over  more  conventional  neurzd 
architectures; 

1.  Temporal  properties  can  be  stored  to  improve  learning  ability.  This  is  important  for  short  sequence 
reproduction  task. 

2.  The  learning  algorithm  is  an  extension  of  the  backpropagation  algorithm,  therefore,  the  partial  re¬ 
current  architecture  will  not  complicate  the  learning  process.  The  feedback  connections  are  fixed,  all 
modifiable  weights  are  in  feedforward  connections,  so  backpropagation  or  other  feedforward  learning 
algorithms  may  easily  be  used  for  training. 

3.  Feeding  back  the  output  will  provide  more  information  to  train  the  network,  which  makes  the  network 
remember  cues  from  the  recent  past. 

The  drawbacks  of  the  architecture  are: 
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1.  Mote  connections  may  result  in  poor  generalization  or  overtraining.  Weight  decay  or  other  prun¬ 
ing  techniques  must  be  used  to  prevent  overly  complex  networks.  Network  architecture  size  can  be 
estimated  follow  Baum  [1]. 

2.  IVsining  is  usually  slow.  This  is  because  the  networks  are  larger  with  more  weights  and  inputs  to 
represent  time  delays. 

3  Channel  Equalization 

In  this  section,  the  channel  equalization  problem  is  tackled  by  using  different  neural  network  models  includ¬ 
ing  our  proposed  PRTDNN  architecture.  First,  several  typical  channels  are  presented,  then  the  network 
architecture  and  parameters  ue  discussed.  Finally,  comparison  results  are  shown. 

3.1  Channel  Characteristics 

The  input  samples  are  chosen  from  {—1,1}  with  equal  probability  and  are  assumed  to  be  independent  of  one 
another.  The  additive  noise  samples  nj  are  chosen  independently  from  a  Gaussian  distribution  with  mean 
0  and  variance  a\.  The  above  system  has  been  used  to  model  a  variety  of  communication  systems,  such  as 
HF  communication  channels  [3].  The  task  of  the  equalizer  is  to  recover  the  transmitted  symbols  based  on 
the  channel  observation,  with  the  performance  measure  being  the  error  probability.  For  easy  comparison, 
we  use  the  same  channels  as  [3]  [10]  [2].  The  following  are  the  iZ-transforms  of  the  channels  that  we  used; 


A) 

H{z)  =  0.3482-f-0.8704z'‘-b  0.34822-2 

(1) 

B) 

H{z)  =  0.4084  -f-  0.8I642-‘  +  0.40842-2 

(2) 

C) 

H{z)  =  0.7255 -»-0.58042-‘-b  0.36272-2  q- 0.07242-2 

(3) 

Channel  A  is  a  nonminimum  phase  channel,  channel  B  is  a  near  catastrophic  nonminimum  phase  channel 
and  channel  C  is  a  minimum  phase  channel. A  channel  is  called  catastrophic  if  there  exists  two  infinite 
length  paths  that  diverge  from  a  state  (never  remerging)  with  finite  distance  in  minimum  squared  Euclidean 
distance  [8].  Channel  A  has  one  zero  outside  the  unit  circle  in  the  Z-plane.  Channel  B  has  two  zeros  close 
to  the  unit  circle.  All  zeros  of  channel  C  lies  within  the  unit  circle. 

All  the  channel  transfer  functions  are  normalized,  that  is  ,  for  transfer  function 

•=r0 


we  have 

t=:0 

The  signal  to  noise  ratio  (SNR)  is  then  given  by 

SNR  =  Xfal 

where  <r^  is  the  noise  variance. 

The  decision  device  is  simply  a  hard-limiter. 

Gibson  [3]  propose  the  idea  of  applying  neural  networks  to  the  channel  equalization  problem.  He  uses  a 
standard  three  layer  feedforward  neural  network  with  the  backpropagation  algorithm  as  a  training  algorithm. 
For  minimum  phase  channels,  the  neural  equalizer  performs  well  in  high  noise  environments.  The  LTE 
(Linear  Traversal  Equalizer)  works  fine  for  high  SNR.  For  nonminimum  phase  channels,  the  neural  equalizer 
outperforms  the  LTE  because  the  neural  equalizer  can  form  a  nonlinear  decision  boundary.  He  uses  5-9- 
3-1  perceptron  as  an  equalizer  for  channel  A.  A  good  method  to  improve  the  performance  is  to  introduce 
feedback.  Siu  [10]  propose  the  first  DFE  (Decision  Feedback  Equalizer)  neural  network  based  on  Gibson  [3]. 
The  same  channel  is  tested  by  4-9-3-1  network  with  one  decision  feedback.  The  performance  improvement  can 
be  seep  clearly  at  high  SNR  (above  15db).  He  finds  that  the  MSE  decreases  as  the  training  samples  increase. 
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Usually,  we  use  1000  training  samples.  These  papers  compare  the  neural  equalizer  with  conventional  LTE 
or  DFE.  Better  performance  can  be  achie-  d  by  MLSE  (Maximum-likelihood  Sequence  Estimation)  with 
a  long  decision  delay.  Chen  [2]  propose  a  >>ian  neural  equalizer  using  a  Radial  Basis  Function(RBF) 
architecture.  He  applied  the  neural  equalizer  o  channels  B  and  C.  The  result  shows  that  when  the  neural 
equaliser  and  MLSE  with  Viterbi  algorithm  have  the  same  decision  delay,  the  BEPs  (Bit  Error  Probability) 
are  comparable.  The  MLSE  only  offers  superior  performance  when  it  has  a  long  decision  delay.  In  general 
the  Bayesian  neural  equalizer  can  not  achieve  the  performance  bound  set  by  the  MLSE  since  it  is  only  a 
symbol-decision  equalizer.  Based  on  the  above  architecture,  we  apply  the  time  delay  neural  architecture  to 
all  three  channels. 

3.2  Neural  Equalizer  Structures  and  Comparison  Result 

The  network  architecture  is  selected  according  to  channel  transfer  function  and  heuristic  experimentation. 
The  activation  function  for  the  PRTDNN  is 


e*  +  e-* 


The  simulation  results  are  shown  in  Figure  (  2  -  4).  The  curves  are  LogioBEP  versus  SNR  for  each 
channel.  The  result  is  quite  near  the  MLSE  bound.  The  BEP  for  ISI  free  channel  is  Q{1/<t),  BEPdfe  = 
Q(hoM,  BEPulse  =  Q{d„inl2a)  where  Q{x)  =  l/>/5i/“  [2]  [8]. 

For  channel  A  the  BEP  achieved  by  PRTDNN  is  better  than  that  of  LTE  or  MLP(LTE)  in  Gibson  [3] 
because  of  the  feedback  architecture.  Results  show  that  the  PRTDNN  is  also  better  than  that  of  Siu  [10]  in 
high  SNR.  There  is  3db  gain  at  BEP=  10~^.  Compared  with  Chen  [2],  there  is  1.5db  gain  at  BEP=  10~^. 
PRTDNN  outperforms  other  methods  for  channel  A. 

For  channel  B  the  BEP  achieved  by  PRTDNN  and  Chen  is  comparable.  In  high  SNR  environments, 
PRTDNN  performs  slightly  better.  For  low  SNR,  both  of  them  approximate  the  MLSE  bound.  MLSE  need 
a  very  long  delay  to  get  good  performance  because  of  the  near  catastrophic  channel.  Conventional  DFE  does 
not  work  well  for  nonminimum  phase  channel  A  and  B  as  shown  in  Figure  2  and  3. 

For  channel  C  the  time  delay  neural  equalizer  works  better  than  that  of  Chen  [2],  but  not  as  good  as 
using  the  Viterbi  algorithm  with  long  decision  delay,  the  MLSE  bound.  For  this  channel  the  conventional 
DFE  works  fairly  well  and  usually  converges  faster  than  neural  equalizers. 

The  simulation  shows  for  minimum  phase  channel,  the  conventional  DFE  works  line,  but  for  nonminimum 
phase  channel,  neural  equalizer  works  better  than  conventional  DFE.  The  PRTDNN  not  only  considered 
decision  feedback,  but  also  included  the  temporal  relationship  between  current  observation  and  that  of  the 
recent  past,  which  results  in  it  outperforming  other  methods. 

4  Summary 

In  the  paper,  we  showed  a  partial  recurrent  time  delay  architecture  and  algorithm.  We  have  explored  the  use 
of  the  proposed  algorithm  as  applied  to  three  channel  equalization  problems.  We  compared  the  performance 
of  the  algorithm  in  terms  of  Bit  Error  Probability  (BEP).  We  found  that  the  proposed  algorithm  works 
better  for  channel  equalization  problems  than  other  neural  equalizers.  For  the  sequence  reproduction  and 
the  sequence  recognition  problem,  examples  include  channel  equalization  problems,  the  partial  recurrent 
architecture  is  worthy  of  further  investigation.  This  is  because  the  architectures  and  algorithms  are  not 
very  complex  as  compared  to  real  time  recurrent  architectures  and  test  results  are  improved  over  simple 
feedforward  networks  by  introducing  feedback  and  delay. 

Further  research  would  include  optimizing  parameters  via  regularization  techniques  and  testing  channels 
with  additive  non-Gaussian  noise  and  also  channels  that  are  time  varying. 
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ABSTRACT:  Image  compression  based  on  neural  networks  is  presented  with  block  classificaikm  and  coding. 
Nfiddlayer  perceptron  with  error  back-propagation  learning  algoritlan  is  used  to  transform  the  normalized  image 
data  into  die  compressed  hidden  values  by  reducing  spatial  redundancies.  Image  comiaession  can  basically  be 
achieved  with  smaller  number  of  hidden  neurons  than  the  numbm  of  input  and  ouqmt  neurons.  Additionally, 
the  image  blocks  can  be  grouped  for  adqrdve  compression  ratios  depending  on  the  characteristics  of  the 
complexity  of  the  blocks.  The  quantized  output  of  the  hidden  neuron  can  also  be  entropy  coded  or  vector 
quantized  for  an  efficient  transmission.  Self-organizing  featuremap  shows  better  performance  than  vector 
quantization.  In  computer  simulation,  about  25:1  compression  ratio  was  achieved  using  the  entropy  coding 
without  much  degiadadmi  of  the  reconstructed  images,  and  about  4045:1  compression  ratio  using  vector 
quantization  or  self-organizing  feature  miq>. 


1.  Introduction 

Because  of  its  massive  parallelism,  global  operation,  adaptive  learning,  noise  robusmess,  and 
generalization  {vopoty.  neural  networics  is  a  good  candidate  in  signal  processing  applications  whoe  high 
computational  power  is  requited[l,2].  In  particular,  some  recent  contributions  of  neu^  networks  have  been 
rqxvtBd  for  the  image  data  compression  iqq)iicatk)ns[14]. 

Multilayer  neural  netwo^  with  error  back-prc^gation  learning  algorithm  are  used  to  transform  the 
image  data  into  the  compressed  data  in  the  ouqnrts  of  hidden  neurons  by  the  reduction  of  qntial  redundancy[l]. 
The  number  of  neuron  units  at  the  hidden  layer  is  smaller  than  those  the  input  and  ouqmt  layers.  We  propose 
an  adaptive  compression  nwthod,  which  clakifies  image  blocks  to  compress  at  different  ratios  according  to  the 
characteristics  of  the  blocks  for  higher  cominession  ratios  and  good  generalization  property.  Also  coding 
methods  of  the  onqiuts  of  the  hidden  neurons  ate  proposed  frar  mcne  compression  and  efficient  transmission  of 
the  comfsessed  image  data  ftv  reconstruction.  Section  2  describes  briefly  a  typical  conqrtesaon  method  that 
emidoys  two-layer  neural  netwmks  and  proposed  adaptive  ima^  cominessionAeconstruction  processes.  The 
proposed  method  divides  image  blocks  into  four  clakes  by  the  classification  algorithm.  And  the  quantized 
outputs  of  the  hidden  neurons  can  be  coded  by  entropy  coding,  vector  quantization,  or  self-tHrganizing 
feamretmqKSOFM).  Section  3  defines  the  evaluation  criteria  and  the  comiaession  ratio,  and  section  4  shows 
computer  simulation  results.  Finally  section  5  concludes  the  paper. 

2.  Image  Compression/Recoi^ructimi  by  Nmiral  Networks 

The  image  comfHessionAeconstruction  architecture  based  on  neural  networks  is  shown  in  Fig.  1.  We  use 
two-layer  neural  netwoiks,  where  the  hidden  neurons  are  duplicated  for  data  transmission.  Data  compression 
cim  bakcally  be  achieved  widi  smaller  number  of  hidden  neurons  than  the  numbers  of  input  and  output  noirons, 
which  are  assigned  the  same  values  of  image  data  ntnmalized  with  8  bits  during  learning.  All  ouqmts  of  the 
hidden  neurons  with  sigmoid  characteristics  ate  quantized  uniformly  with  6  bits[l]. 

The  image  compressirmAeconstruction  processes  are  shown  in  Fig.  2.  We  use  the  original  images  that  ate 
divided  into  8x8  pixel  blocks.  First,  the  block  classifier  clarifies  blocks  of  each  ittuige,  then  each  pixel  value  is 
normalized  and  its  value  is  inputted  to  neural  networks.  At  the  transmission  channel,  hidden  values  of  neural 
netwodcs  ate  coded  by  entrc^y  coding ,  VQ,  or  SOFM.  The  processes  ate  revised  for  the  reconstruction.  Now 
we  mqdain  these  processes  more  in  detail. 
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Fig.  1.  Neural  network  architecure  for  compres-  Fig.  2.  Image  compression/ieconstruction  processes, 
sion  and  reconstruction. 


2.1  Preprocessing  ^age  -  block  classifier 

The  image  block  classifier  is  proposed  so  that  compression  is  carried  out  by  two-layer  neural  networks 
with  difforat  sizes,  of  the  hidden  layers  acctsding  to  die  ci  jmplexity  of  the  blocks.  The  classification  algorithm 
including  gradient  calculation  and  edge  detection[4.S]  is  used  with  some  simplification.  The  image  blocks  are 
classified  into  four  categories:  the  shade  (class  1:  no  significant  gradient),  the  complicated  (class  2:  definite 
mixed  edge),  the  edged  (class  3:  definite  single  edge  •  horizontal,  vertical,  or  diagcmal),  and  the  midrange  block 
(class  4:  moderate  gradient,  no  edge). 

The  shade  block  is  based  on  the  well-known  fact  that  intensity  changes  smaller  than  the  Weber  fiacdon  T, 
are  not  visible[S]  •  this  property  was  proved  by  Weber's  law  through  psycovisual  experiments.  Weber’s  law 
stales  that  the  noticeable  (Merence  depends  on  intensity.  So  we  comiHessed  the  Shade  block  as  only  one 
average  gray  level  value.  The  other  blocks  are  compressed  at  different  ratios  with  different  numbers  of  hidden 
neurons.  Examples  of  the  classified  blocks  of  the  Loia  image  are  shown  in  Fig.  3. 

12  Mapping  stage  •  main  compression/reconstruction 

The  values  of  8  bit  image  data  are  normalized  from  -I.O  to  I.O  as  input  values  of  neural  networia.  The 
nrmnalized  image  block  is  fed  into  the  input  layer  on  the  compression  side(rjie  stage),  and  reconstructed  image 
tdock  is  obtained  from  the  ouqiut  layer  on  the  reconstruction  side(back  stage).  Class  1  blocks  need  no  neural 
netwr^  while  class  2,  3,  and  4  blocks  are  implied  to  the  cwresponding  neural  networks  widi  the  8, 6,  and  4 
hidden  neurons  respectively.  All  output  values  of  hi&len  layers  are  quantr -xl  as  6  bits. 

12  Coding  -  entropy  coding,  vector  quantization,  and  self-organizing  feature  map 

The  quantized  ouqiut  of  the  hidden  neurons  can  be  coded  for  more  compression  by  entropy  coding,  vecUv 
quantization  or  self-organizing  feature  mtq).  In  case  of  entn^y  coding,  the  differential  ouqnit  values  of  hidden 
neurcms  are  entropy  encoded  to  achieve  lossless  compression.  One  approach  is  to  construct  a  variable-length 
code,  such  as  a  mo^ed  Huffman  code[6].  To  encode  the  hidden  values  with  a  modified  Huffman  code,  that  is 
nuitched  to  the  statistics  of  the  differential  hidden  values,  we  define  the  entropy  H  by 
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Fig.  3.  Examples  of  the  classified  blocks. 

L 

^  =  -IPilOg2  Pi  (4) 

i-l 

L 

where  P.  is  the  probability  that  the  message  will  be  the  ith  value.  Since  IPi  =1,  it  can  be  shown 

i*l 

that  0  ^  logj  L.  From  the  information  theory,  the  entrqiy  /f  in  (4)  is  the  theoretically  minimum  possible 
average  bit  lado  required  for  coding  a  message.  Supposing  the  average  bit  ratio  using  the  codewords  that  we 
have  designed  is  the  same  as  the  enttq>y[7].  we  use  entropy  coding  in  simulation. 

In  vector  qnantization(VQ),  the  hidden  neuron  activations  are  decomposed  into  n-dimensional  vectors. 
These  n  x  1  hid^n  vectors  are  vector  quantized  according  to  the  codebook.  The  LBG  algorithm  was  used  in 
training  the  vector  quantizers[8]. 

In  self-organizing  feature  map  (SOFM),  the  hidden  neuron  activations  of  the  two-layer  neural  networks 
are  irqmt  vectors  of  the  SOFM.  Once  the  SOFM  has  been  trained,  the  weight  vectors  wiU  ^  organized  into  an 
approximation  of  the  distribution  function  of  the  input  vectors.  This  is  compared  as  VQ.  The  hyiwid  structure 
nsuig  MLP  and  SOFM  is  shown  in  Fig.  4. 

2A  Postprocessing  -  filter 

When  die  compression  ratio  is  very  high,  the  boundaries  of  adjacent  blocks  become  quite  distinguishable, 
which  is  called  die  blocking  effect  due  to  independent  coding  of  each  block.  To  reduce  the  blocking  effect  one 
usually  filtms  the  image  after  reconstruction[7].  A  low  pass  filter  is  used  to  improve  the  quality  of  reconstructed 
inures  and  typically  spfdied  only  at  or  near  the  block  boundaries  to  avoid  unnecessary  image  blurring.  We  don't 
use,  however,  diis  postprocessing  in  simulation  for  the  comparison  of  compression  p^ormance. 

As  a  summary,  the  compression  processes  have  two  coding  schemes;(l)  lossy  coding  using  neural 
networks  which  is  a  quantization  of  hidtten  neuron  activations,  and  (2)  lossless  coding  using  an  entropy  code 
with  the  differences  of  hidden  values  or  lossy  coding  using  VQ  or  SOFM. 
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Fig.  4.  Hybrid  structure  using  MLP  and  SOFM. 

3.  Evaluatioii  Criteria  and  Compression  Ratio 

We  make  use  of  mean-squared  erroifMSE)  and  peak  signal-to-noise  ratiofPSNR)  as  error  measure  of 
reconstruction. 


1  W-lN-l 
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The  compression  ratio(CR)  without  and  with  oitrqpy  coding  are  measured  by  equations  (7)  and  (8). 
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where  is  block  size  and  H  is  the  number  of  hidden  neurons,  N.  and  are  the  numbers  of  bits  used  in  the 
input  and  hidden  layers,  respectively.  In  vector  quantization,  CR^  is  calculated  firom  the  entropy  in  the  Shade 
Modes,  code  size  in  the  vector  quantizer,  and  code  for  classifying  class.  In  self-organizing  featurernsq),  CR^  is 
calculated  from  the  entropy  in  the  Shade  blocks,  the  size  of  output  neurons  fev  SOFM,  and  code  for  classifying 
class. 


4.  Simulation  Results 

In  simulation,  four  images  of  8  bit  gray  levels,  i.e.  Lena,  Bridge,  Boat,  and  Train,  are  used.  At  first  we 
had  tried  the  simple  neural  networks  approach  without  any  coding,  which  are  trained  by  the  Lena  image  and 
tested  by  the  other  three  images.  We  try  two  cases  with  8  hidden  neurons  and  4  hidden  neurons  which  give 
about  10:1  and  21:1  compression  ratio,  respectively.  Because  the  Bridge  image  is  very  complicated,  PSNR  is 
very  low. 

Now  let’s  look  at  performance  improvements  by  added  features  at  the  presented  architectures.  The  class 
distribution  for  each  image  is  shown  in  Table  1.  For  example,  about  6%  blocks  are  Shade  blocks  in  the  Lena 
image.  We  note  the  Bridge  image  is  the  most  complicated  with  about  S8%  of  Complicated  blocks. 

At  the  second  experiments,  the  neural  networks  are  trained  by  the  classified  blocks  of  the  Lena  image  and 
tested  by  the  other  three  images  and  the  entropy  is  calculated  firom  6  bit  output  values  of  hidden  neurons. 
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Shade  blocks  need  no  neural  network,  while  the  Complicated,  Edged,  and  Midrange  blocks  are  tq^lied  to 
the  corresponding  neural  networks  with  8,  6,  and  4  hidden  neurons,  respectively.  We  get  about  2S:1 
comisession  ratio  without  much  degradation. 

At  the  third  experiment,  the  same  neural  networks  are  used,  but  values  of  hidden  activations  are  quantized 
using  vector  quantization.  For  learning  of  the  vector  quantizer  of  1024  codewords  of  each  class,  we  use  the 
standard  LBG  algorithm  in  training  vector  quantize  with  the  hidden  values  of  the  Lena  and  Bridge  images. 

At  the  fourth  experiment,  we  use  SOFM  instead  of  VQ.  The  size  of  input  vectors  is  the  same  as  VQ.  The 
number  of  ouqHit  neurons  is  1024.  Simuation  results  are  shown  in  Table  2. 


The  major  advantage  of  this  approach  is  its  good  performancefor  un-trained  images,  and  the  image 
cominessicMi  using  the  block  classifler  and  coding  is  more  effective  at  high  compression  ratio.  All  the 
simulation  results  for  the  Lena  image  ate  summarized  in  Fig.  S.  N  is  neural  networks,  E  is  entropy  coding,  B  is 
block  classifier,  V  is  vector  quantization,  and  SOFM  is  self-organizing  m^.  In  this  figure,  Ae  JPEG(Joint 
Photognqdiic  Expert  Group)[10]  is  very  good,  below  30:1  compression  ratio,  but  performance  drops  very 
nqndly  fm  higer  compression.  Intopolative/Residual  VQ(I/RVQ)[10]  can  achieve  higher  compression  ratio,  but 
its  PSNR  is  not  good.  Although  performance  of  the  simple  neural  network  approach  is  very  limited,  better 
performance  may  be  achieved  by  block  classification  and  coding  based  on  vector  quantization  and  SOFM. 
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6.  Conclusion 


In  this  paper,  we  present  a  new  method  of  image  compression  and  reconstruction  using  neural  netw(»ks, 
block  classification,  entropy  coding,  VQ,  and  SOFM.  We  got  about  2S;1  compression  ratio  without  much 
degradation  of  the  reconstructed  images  in  entropy  coding  and  about  40-45:1  compression  ratio  with  some 
degradation  in  VQ  or  SOFM.  Also  we  propose  hybrid  model  with  MLP  and  SOFM  which  shows  good 
peifcHmance  in  high  compression  ratio  region.  As  funire  work,  we  are  concerned  with  color  image  com];»tession 
and  video  coding  using  neural  networks. 
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Abstract 

We  study  the  convergence  of  the  Least  Mean  Squares  learning  algorithm  in  self-referential  linear  stochastic  models 
when  ctgents  form  their  expectations  according  to  a  misspecified  version  of  the  model  itself.  The  law  of  motion 
perceived  by  the  agents  influences  the  actual  law  of  motion  of  the  model.  In  this  framework,  2tgents  experience  a 
le^iming  activity  by  which  they  update  their  estimates.  The  so  called  rational  expectations  equilibrium  is  obtmned  if 
agents  take  their  expectations  consistently  with  the  “true”  model.  We  assume  that  agents  update  their  estimates  by 
a  “modified”  Least  Mean  Squares  learning  cdgorithm.  The  convergence  of  this  algorithm  to  the  rational  expectations 
equilibria  of  the  model  is  analyzed  and  a  general  convergence  result  is  obtained.  The  point  to  which  the  algorithm 
converges  is  dependent  on  the  strength  of  the  noise  signal  which  effects  the  model  and  on  the  chareicteristics  of  the 
function  which  weighs  the  noise  signal  itself.  The  main  difference  with  respect  to  similar  studies  about  convergence  of 
learning  mechanisms  to  rational  expectations  equilibria  in  self-ref erentiai  linear  stochasttc  models,  lies  in  the  algorithm 
that  is  not  the  Ordinary  Least  Squares  usu<dly  adopted  in  the  literature. 

1  Introduction 

In  this  paper,  we  address  an  interesting  problem  in  economic  theory;  the  interaction  between  the  evolution 
of  a  Self-Referential  Linear  Stochastic  (SRLS)  model  dependent  on  agents’  beliefs  and  a  learning  process  for 
the  agents  based  on  a  misspecihed  version  of  the  model  itself. 

If  the  agents  know  the  “true”  model,  then,  the  Rational  Expectations  Equilibrium  (REE)  is  the  solution 
of  the  system.  At  the  REE,  agents  use  optimally  their  private  information  consistently  with  the  “true” 
model  and  the  expectation  errors  conditioned  on  the  available  information  set,  have  zero  mean.  If  the  agents 
form  their  expectations  by  a  misspeciiied  model,  then  expectation  errors  have  not  zero  mean  emd  a  learning 
activity  takes  place. 

In  the  literature  [1],  it  has  been  assumed  that  agents  believe  in  a  linear  model,  characterized  by  the 
parameter  matrix  B,  and  that  they  update  their  estimates  by  the  means  of  the  Recursive  Ordinary  Least 
Squares  (ROLS)  algorithm.  In  [1]  H  has  been  proved  that,  with  the  ROLS  mechanism,  the  convergence  is 
always  to  the  REE  point  but  that  it  is  guaranteed  only  for  a  restricted  set  of  functions  and  parameters  of 
the  model. 

In  this  study,  we  assume  that  agents  update  their  estimates  by  a  “modified”  version  of  the  Least  Mean 
Squares  (LMS)  eilgorithm  [2].  We  analyze  the  convergence  of  the  LMS  algorithm  for  a  general  class  of  SRLS 
models.  Inside  this  framework,  it  is  necessary  to  modify  the  “classical”  LMS  eJgorithm  by  using  a  decreasing 
learning  factor.  Convergence  is  proved  by  deriving  the  ordinary  differential  equation  associated  with  the 
“modified”  LMS  updating  rule,  see  [3].  The  convergence  is  in  probability.  We  prove  that  the  “modified” 
LMS  algorithm  converges  to  an  equilibrium  point  of  the  differential  equation  but  that  such  point  cam  differ 
from  the  REE  according  both  to  the  strength  of  the  noise  signal  and  to  the  type  of  function  which  weighs 
the  noise  signal  itself  inside  the  SRLS  model. 

The  paper  is  organized  as  follows.  In  section  2,  SRLS  models  are  briefly  described.  In  section  3,  the  LMS 
algorithm  is  applied  to  this  framework.  Then,  convergence  results  are  discussed. 
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2  Self-referential  linear  stochastic  models 

Following  [1],  we  denote  with  two  subvectors  of  zt  €  3?",  not  necessarily  disjoint,  the  set  of  variables  that 
the  agents  are  interested  in,  i.e.,  zn  €  W"*,  and  the  set  of  variables,  i.e.,  Z2t  €  that  the  agents  think  are 
relevant  lo  predict  the  variables  in  zu- 

In  the  literature,  the  linear  law  of  motion  for  zu  perceived  by  the  agents,  at  time  t,  is  usually  described 
by; 

Zl,  =  BJ  Z2(t-1)  +  Pt  (1) 

where  Bt  €  is  the  parameter  matrix  representing  the  perceived  law  of  motion  of  zu  and  pt  G  is 

a  noise  vector.  The  agents’  beliefs  in  (1)  cause  the  actual  law  of  motion  for  the  entire  vector  Zt  to  be  given, 
in  a  general  setting,  by: 


Z\t 

■  0  T{BtV  ' 

h 

-»(-.)  J " 

B(Btf 

where  the  superscript  c  expresses  the  complement  with  respect  to  Zt ,  «t  €  S"  is  a  stationary  white  noise, 
T{Bt)  is  the  application  which,  given  Bt,  describes  the  actual  law  of  motion  for  zu  at  time  t,  i.e.,  T  :  Bt 
T{Bt)  G  Si"*’*"' .  The  function  V{Bt)  modulates  the  noise  term  in  the  SRLS  model  according  to  the  agents’ 
estimations  represented  by  Bt,  V  :  Bt  ^(^t)  6  S'”*”*.  The  other  applications  are  defined  as; 

A  -.Bt-*  A{Bt)  e  B:Bt^  B{Bt)  G 

Note  that  the  agents’  estimation  which  is  represented  by  Bt  in  (1),  defines  together  with  the  features  of  the 
model,  the  actual  law  of  motion  in  (2). 

A  REE  of  the  SRLS  model  in  (2)  is  a  fixed  point  of  the  application  T(B),  i.e.,  B*  such  that  T(B*)  =  B* . 

The  data  generating  process  in  (2)  does  not  imply  that  zt  is  a  stationary  stochastic  process.  Although 
the  LMS  learning  algorithm  has  been  applied  also  in  non-stationary  environments  [4],  we  restrict  our  study 
to  the  stationary  case.  Let  us  define  the  set  D,  where  the  SRLS  model  (2)  is  a  stationary  stochastic  process. 

P,  =  {Be  jthe  eigenvalues,  i.e.,  Aj,  of 

0  T(Bf  ■ 

.  MBf 

are  less  than  unity  in  absolute  value,  i.e.,  |Aj|  <  1} 

In  the  literature,  the  algorithm  used  to  update  the  estimates  of  the  parameter  B  is  the  ROLS,  see 
[1].  Convergence  of  such  algorithm  to  the  REE  of  the  SRLS  model  has  been  analyzed  using  the  ordinary 
differential  equation  associated  with  the  ROLS  updating  rule,  following  [3].  It  has  been  proved  that  agents 
succeed  in  reaching  the  REE  of  (2)  when  the  ordinary  differential  equation  is  stable,  see  [1].  Let  us  remark 
that  the  ROLS  algorithm  is  not  sensitive  neither  to  the  noise  term  «t  nor  to  the  type  of  function  V  {B)  which 
weighs  the  noise  signal  itself. 

In  the  following,  a  similar  procedure  will  be  used  to  prove  the  convergence  of  the  “modified”  LMS 
algorithm. 

3  The  Least  Mean  Squares  algorithm  in  self-referential  linear 
stochastic  models 

The  ROLS  mechanism  has  been  used  to  update  agent  estimates  because  it  guarantees  a  good  performance 
being  the  “Best  Linear  Unbiased  Estimator”.  However,  the  agents’  model  is  misspecified  inside  the  SRLS 
framework.  An  alternative  learning  procedure  deriving  from  the  engineering  literature  on  the  adaptive  signal 
processing,  is  the  LMS  algorithm  [2].  In  order  to  apply  such  mechanism  to  the  SRLS  framework  and  to  use 
it  as  a  learning  mechanism  for  the  agents  to  uf)date  the  parameters  B,  it  is  necessary  to  define  how  agents 
estimate  the  vector  of  variables  zu- 

We  define  the  following  linear  perceived  law  of  motion  for  the  agents: 

^it  —  •  (3) 
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Let  us  remark  that  (3)  represents  a  deterministic  perceived  law  of  motion  and  does  not  include  a  noise 
component  as  (1).  However,  a  stochastic  component  in  (3)  does  not  affect  the  convergence  analysis  carried 
out  below. 

In  order  to  apply  the  LMS  algorithm  to  the  SRLS  model,  the  instantaneous  error  e*  6  shall  be 
computed,  i.e.,  €t  =  —  ^u>  where: 

•  zlt  is  the  forecasted  value  of  z\t  by  the  means  of  (3), 

•  ?it  is  the  actual  law  of  motion  generated  by  (2)  as  z^  = 

Let  us  define  /?<  €  the  i-th  column  of  the  parameter  matrix  B.  Let  us  express  the  functions  T{  )  and 
V^(  )  when  evaluated  at  ft,  for  a  =  1, . . . ,  rii:  T<  :  l3i  —¥  Ti{Bi)  €  3?"’  is  the  i-th  column  vector  of  the  matrix 
T(B)  and  Vi  :  ft  — >  V<(ft)  €  91"  is  the  i-th  column  vector  of  the  matrix  V{B). 

The  LMS  algorithm  updates  the  parameter  matrix  B  to  minimize  the  Mean  Square  Error  (MSE)  related 
to  the  SRLS  model  in  (2).  The  MSE  is  given  by  the  mean  value,  taken  over  time,  of  the  square  of  the 
instantaneous  error  e*.  Let  us  fix  the  parameter  matrix,  i.e.,  drop  the  subscript  t.  Because  it  has  been 
assumed  that  EistiStj}  =  0  for  i  ^  j,  then  the  MSE,  expressed  by  C(B),  has  its  i-th  component  equal  to 
|j(ft  )  =  E{Ef^}  =  ^{(zit,  —  2®,  )  }.  Using  the  agents’  estimations  in  (3)  the  i-th  MSE  is  expressed  as 

ii(Si)  =  EUit,}  +  »  =  ■ 


Let  us  introduce  the  matrices  (ft)  =  ^  M„,  =  E{utuJ}  6  SR"*'"  and  Ct(ft)  = 

€  3?"’^"’.  Because  of  the  restriction  to  stationary  stochastic  processes,  the  statistics  of  the 
process  z*  and  Ut  are  time  invariant,  i.e.,  Mz^,{Si)  =  Mz^iSi),  Af„,  =  M„  and  Ct(ft)  =  C'(ft)  for  B  £  D,. 
Note  that  the  assumption  C(ft)  =  0,  usually  done  in  economics,  implies  that  the  MSE  function  is  given  by 
the  sum  of  two  quadratic  forms: 


^.•(ft)  =  (Ti(ft)  -  0if  MzziSi)  (T<(ft)  -  ft)  +  ViiSif  M^Vi(0i)  ,  i  =  1, . . . ,  m 


(4) 


Depending  on  functions  T( )  and  U( ■ ) ,  the  function  (ft )  can  have  more  than  one  global  maximum/minimum, 
local  maximum/minimum  and  saddle  points.  Note  that  a  REE  is  a  minimum  for  the  first  quadratic  form 
in  (4)  but  is  not  necessarily  a  minimum  of  the  MSE  function. 

We  have  to  remark  that  the  LMS  algorithm  looks  for  minima  and  saddle  points  of  the  MSE  function  that 
can  be  not  global  minimum.  !n  order  to  compute  the  stationary  points  of  the  MSE  function,  let  us  evaluate 
the  gradient  of  the  i-th  MSE  component.  The  gradient  of  the  i-th  MSE  function  is: 

V4.(ft )  =  2  (5^  -  if  MzM)  mm  -  ft)  -1-  K  (ft)  ,  i  =  1, . . . ,  m  . 


As  a  result,  ft*  such  that  ft*  =  Ti{0*)  is  not  necessarily  a  critico/ point  of  the  i-th  MSE  surface. 

If  the  noise  component  of  the  model  «e ,  which  agents  do  not  take  into  account  in  their  signal  extrsiction 
activity,  enters  directly  the  model  and  is  weighed  by  a  constant  function  V{B)  which  is  not  dependent  on 
their  beliefs,  then  a  REE  is  a  minimum  of  the  MSE.  Otherwise,  if  the  noise  term  enters  the  SRLS  model 
weighed  by  the  function  V(B),  i.e.,  V{B)^ut,  which  depends  on  the  agents’  estimate  B,  then  the  minimum 
of  the  MSE,  depending  on  the  noise  component,  is  different  from  the  REE.  In  this  case,  the  LMS  learning 
mechanism  is  not  able  to  reeich  the  REE  but  deviates  to  a  non-REE. 

Let  us  remark  that  the  LMS  algorithm  always  converges  to  a  point  that  can  be  either  a  non-REE  or  a 
REE,  depending  on  the  net  noise  signal.  On  the  contrary,  the  ROLS  algorithm  can  also  diverge  regardless 
the  characteristics  of  the  net  noise  signal. 


In  [5],  we  have  proved  the  convergence  on  average  of  the  LMS  algorithm  to  the  REE  when  applied  to 
SRLS  models  where  T{  )  is  a  linear  function  of  B  and  V^(  )  is  independent  of  B.  To  analyze  the  convergence 
of  the  LMS  learning  algorithm  for  a  more  general  class  of  SRLS  models,  we  have  to  restrict  the  attention  to 
the  convergence  in  probability  by  applying  the  framework  introduced  by  Ljung  [3].  Moreover,  the  “classical” 
LMS  algorithm  is  modified  by  assuming  a  decreasing  learning  factor  7t  instead  of  a  fixed  77. 

In  the  following,  the  reasons  that  cause  the  “classical”  LMS  algorithm  not  to  converge  on  average  in 
a  more  general  class  of  SRLS  models,  are  briefly  sketched.  Then,  the  LMS  algorithm  is  modified  and  the 
convergence  to  the  minimum  of  the  MSE  is  proved. 
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Let  us  recsdl  the  LMS  expression  used  to  update  the  i-th  parameter  column  [2], 

A+i.  =  A. -2f7et.|^  ,  i=l,...,ni  ■  (5) 

where  ij  is  a  positive  constant  such  that  tf  €  Hi,  £t,  €  fH  is  the  instantaneous  error,  i.e.,  £t,  =  zie,  —  and 
III*-  €  S"’.  Recalling  the  SRLS  model  in  (2),  then  the  derivative  of  the  instantaneous  error  is  given  by: 


a»..  _  (BTMhl  _  r 
a/j,.  -  a^,,  \ 


22(«-l)  + 


■ «« 


»■  =  . 


As  a  result  the  “classical”  LMS  updating  rule  for  the  SRLS  model  in  (2),  is  given  by: 


Pt+i, 


—  —  0t.) 


22(«- 


-1)  +  ■  dfi, 


-2vK(A.)^U,.  ,  i=l,. 


(6) 


The  rule  in  (6)  updates  the  parameter  /Jt,  according  to  two  terms:  one  includes  the  product 
(Ti(fit,)  —  Pt,)^  ■®'2(t-i)  and  the  other  includes  Vi{Pi)^ Ut.  Let  us  subtract  from  both  sides  of  (6)  the  value  of 
the  REE  point  /?*.  Let  us  take  the  average  of  the  resulting  terms,  assuming,  for  example,  that  both  V{  ) 
and  T(-)  are  linear  functions  of  S  so  that  their  derivatives  are  constants  and  assuming  that  B  is  independent 
of  22  and  Ut- 


E{A+i. -/?■} 


=  E{0t.  -  m  -  2»?  mwt.)  -  Bu} 


Let  us  analyze  such  expression  as  <  — ^  oo.  It  can  be  noted  that  two  series  arise.  The  first  one,  originated 
from  the  signal  Z2t  vanishes  while  parameters  converge  to  the  REE  while  the  second  one,  which  includes 
the  noise  term,  does  not  vrinish  but  takes  the  parameter  B  away  from  the  REE  making  the  overall  process 
diverging  on  average,  from  the  REE. 

In  order  to  use  the  LMS  algorithm  as  agents’  learning  mechanism  in  SRLS  models  where  the  function  V(  -) 
is  dependent  on  B,  it  is  necessary  to  modify  the  “classical”  LMS  updating  rule  which  assumes  rf  a  positive 
constant,  with  the  following  “modified”  LMS  rule  where  rj  has  been  substituted  with  7*  which  is  a  decreasing 
function  ast  00.  The  decreasing  positive  values  given  by  7*  are  used  to  reject  the  noisy  observations.  Let 
us  modify  the  LMS  learning  algorithm  defining  the  parameter  matrix  B  =  =  1, . . .  ,ni}  as  follows; 


0t+i,  =  Pti ,  »=l,...,ni.  (7) 

As  a  consequence,  th-  lied”  LMS  algorithm  in  (7),  when  applied  to  the  SRLS  model  in  (2),  reduces  to 

the  following  updating 

Pt+i,  =  Pt,  -  2'ft+i  ~  •^)  {TiiPt,)  -  Pti) 

+  {TdPt.)  -  Pi.)  (8) 


According  to  [3],  let  the  sets  D2  be  closed  and  Di  open  and  bounded  with  D2,  D\  C  and  £>2  C  I>i  C 

D, .  The  final  algorithm  for  generating  beliefs  Bt  —  {Pti.i  =  1 ,  ■  ■  • ,  ui }  is: 


r  Bt+i  if  Bt+i  6  Di 

\  some  value  in  £>2  if  Bt+\f:D\ 


(9) 


The  most  natural  candidate  for  “some  value  in  £>2”  is  Bf  where  t'  is  the  last  time  that  the  parameter  B  £  D2, 
but  any  other  point  in  D2  is  acceptable.  When  D2  =  D\  =  then  B  =  B  for  all  t. 
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Besides  the  decreasing  learning  factor  7t,  the  “modified”  algorithm  defined  by  (8-9)  deviates  from  that 
in  (6)  because  it  invokes  a  “projection  facility”  (9)  that  prevents  the  estimator  from  ever  leaving  the  set 
determined  by  Di.  In  this  way,  the  observations  that  drive  B  outside  of  Di  are  ignored.  The  “projection 
facility”  is  used  as  to  verify  more  easily  the  hypotheses  used  in  the  Ljung’s  framework,  see  [3]. 

In  order  to  study  the  convergence  of  such  algorithm,  we  apply  the  method  suggested  by  Ljung  in  [3]. 
Accordingly,  we  compute  the  differential  equation  associated  with  (8-9).  Using  matrices  and  Afu,  it 

is  given  by; 

^  =  ,i=l,...,ni.  (10) 

Let  us  define  the  set  Da  as  the  “domain  of  attraction”  of  the  equilibrium  point  of  the  differential  equation 
in  (10)  that  is  a  minimum  point  (5™'")  of  the  MSE.  In  order  to  prove  the  convergence  of  the  “modified” 
LMS  learning  mechanism  in  (8-9)  by  the  means  of  the  associated  differential  equation  in  (10),  the  following 
assumptions  on  the  model  in  (2)  are  employed.  These  conditions  have  been  stated  following  the  guidelines 
in  [1]  as  to  satisfy  hypothesis  of  Theorem  1  in  [3]  on  the  convergence  of  recursive  algorithms. 

A.l  The  ordinary  differential  equation  has  a  unique  fixed  point  B"*", 

A.2  T(  )  and  V'(  )  are  twice  differentiable,  and  A{  ),  B(  )  have  one  derivative  in  D,, 

A. 3  the  covariance  matrices  and  Mu  are  nonsingular, 

A.4  for  all  t,  >  0  and  7j  ^  0  as  <  oo, 

A. 5  the  vector  tij  consists  of  n  stationary  random  variables,  tt|  is  serially  independent.  Further  £'{|«,,|p}  < 
oo  for  all  p  >  1,  for  i  =  1, . . .  n. 

A. 6  Suppose  that  there  exist  an  event  Qq  with  P(Qo)  =  I  such  that  for  e2M:h  w  G  Qq  there  is  one  random 
variable  Ci(a))  and  a  subsequence  {tfc(w)}  such  that  V  <fc(w)  \z2,^  \  <  Ci(w). 

A. 7  Let  Dj  be  a  closed  set  and  Di  be  an  open  and  bounded  set,  with  D2  C  Di  C  D,.  Assume  that  the 
trajectories  of  the  ordinary  differential  equation  with  initied  condition  Bo  =  {0o,,i  =  1, . .  .,ni}  G  D2 
never  leave  a  closed  subset  of  D\ . 

Assumption  A.l  is  made  solely  to  simplify  the  demonstration.  From  [3],  it  is  clear  that  our  results  can  be 
extended  to  a  model  with  multiple  minima  of  the  MSE. 

Proposition  3.1  Let  B  be  given  bg  the  teaming  mechanism  in  (8-9).  Let  B”’’”  be  unique  point  of  attraction 
of  the  ordinary  differential  equation  in  (10)  and  let  Da  be  the  domain  of  attraction  ofB^'’*.  Let  the  initial 
condition  Bo  be  in  D2.  Assume  A.l  to  A.l  are  satisfied.  If 

Di  C  Da,  so  that  D2  C  L>i  C  (d, 

then  B  —¥  B”**"  with  probability  one  as  t  — >  00. 

Proof.  In  order  to  prove  the  proposition  above,  we  use  Theorem  1  in  [3].  Moreover,  let  us  refer  to  the 
assumptions  B.l-B.ll  in  [3].  Note  that: 

•  assumptions  B.l  and  B.2  in  [3]  are  implied  by  our  A.5, 

•  assumptions  B.3,  B.4  and  B.5  in  [3]  are  implied  by  the  smoothness  assumptions  on  7’(  ),  A(  ),  B(  ) 
and  V(-)  in  our  A.2, 

•  ^tssumption  B.&  in  [3]  is  satisfied  because  the  following  limits  exist: 

lime^oo  E  I  [(Ti(A)  -  Z2(t-i)  +  Vi(0ifutY  ■  -  l)^  = 

=  M.APi)  mWi)  -  0i) 

lime-^00  E  I  [{Ti{0i)  -  0if  Z2(t-i)  +  Vi{0ifmY  ■  = 

=  Mu{0im0i) 

where  /?•  is  fixed  and  Z2t  is  evaluated  for  a  solution  of  the  difference  equation. 
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•  Assumption  B.7  is  implied  by  our  A. 5  and  A.2  where  the  Lipschitz  constants  Ki  and  AC2  in  [3]  are, 
respectively  the  norms  of  the  first  and  second  derivatives  of 

/)"’  -  0t.)  +  -  /)"’  ^2(.-l)ti?’K(A.) 

m0^,)  -  A.) + £^"’u,urv'i(A.)] 

with  respect  to  0i  and 

•  Finally  assumptions  B.8  to  £.11  are  implied  by  our  AA. 

Finally  if  A.7  is  satisfied,  then  since  zu  =  If(A.)^22(f-i)  +  it  follows  that  there  exists  a 

subsequence  of  {<*}  such  that  Izitkl,  \z2tA  *nd  Bt,,.  are  bounded  along  this  subsequence;  therefore  (20)  in  [3] 
is  satisfied  and  we  can  apply  the  Theorem  1,  which  states  the  convergence  of  the  solution  of  the  differential 
equation  associated  with  a  general  recursive  stochastic  algorithm.  □ 

Note  that  proposition  3.1  does  not  cover  the  case  in  which,  at  some  points  on  the  boundary  of  Di, 
trajectories  of  (10)  point  away  from  the  interior  of  Di.  TVajectories  that  leave  Di,  are  not  allowed  to  do  so 
by  the  virtue  of  the  "projection  facility”  in  (9).  In  order  to  comply  with  this  fact  the  following  corollary  is 
stated. 

Corollary  3.1  Assume  that  A.1-A.6  are  satisfied,  fii  £  Di  C  D,,  and  that  D\  is  open  and  bounded.  Assume 
that  Di  C  Da-  Given  that: 

Pi  =  P(/?i -> /??) 

P2  =  -+  {Di  —  P2)  for  a  subsequence  {f*(w)}) 

then  Pi  +  Pa  =  1. 

Proof.  Elementary.  □ 

There  is  an  important  class  of  models  in  which  it  is  possible  to  verify  analytically  the  required  behaviour 
of  the  trajectories  of  the  ordinary  differential  equation  (10)  at  the  boundary  of  £>1.  These  classes  are  those 
for  which  Z2  is  exogenous  in  the  sense  that  A{  )  and  £(•)  in  (2)  are  independent  of  B.  Note  that  in  this  case 
=  A/,,. 

4  Conclusions 

We  have  studied  the  convergence  of  a  “modified”  version  of  the  LMS  algorithm  to  the  REE  of  SRLS  models 
when  agents  believe  in  a  misspecified  model.  The  analysis  has  been  carried  out  associating  with  the  “modi¬ 
fied”  LMS  updating  rule  the  ordinary  differential  equation.  This  algorithm  has  been  proved  to  converge  to 
a  point  which  can  differ  from  the  REE.  This  is  dependent  on  the  strength  of  the  noise  signed  which  affects 
the  model  and  on  the  characteristics  of  the  function  which  weighs  the  noise  signal  itself. 

The  “modified”  LMS  algorithm  converges  always  to  a  point  while  other  learning  procedures  such  as 
the  ROLS  algorithm  can  diverge.  These  features  make  the  “modified”  LMS  mechanism  more  natur^d  2Uid 
plausible  to  model  agents’  beliefs  in  SRLS  models. 
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Abstract. 

A  Gidn-Schediiling  Nentsi  Network  Aichitectnre  is  proposed  to  enhance  the  noise-filtering  efficiency  of  feedforward 
nental  networks,  in  terms  of  both  nominal  performance  and  robnstness.  The  syner^tic  benefits  of  the  proposed  ar¬ 
chitecture  ate  demonstrated  and  discussed  in  the  context  of  the  noise-filtering  of  signals  that  are  typically  encountered 
in  aerospace  control  systems.  The  syntheM  of  such  a  gun-schednled  neurofiltering  prorides  the  robustness  of  linear 
filtering,  while  preserving  the  nominal  performance  advantage  of  conventional  non-linear  nenrofiltering.  Quantitative 
performance  and  robustness  evaluations  are  provided  for  the  signal  procesnng  of  pitch  rate  responses  to  typical  pilot 
command  inputs  for  a  modern  fighter  aircaft  model. 

1.  Introduction. 

The  capability  of  feedforward  neural  networks  to  serve  as  noise-filters  for  complex  systems  with  varying 
characteristics  and/or  changing  modes  of  operation  was  recently  analysed  for  the  noise-filtering  of  signab 
that  are  typically  encountered  in  aerospace  control  and  diagnostic  systems  [1].  For  such  systems,  the  nominal 
dynamics  of  the  signab  are  a  simplified  version  of  the  actual  dynamics,  due  to  modelling  approximations, 
system  uncerUunties,  and/or  chanfpng  modes  of  operation.  As  a  result,  the  desired  neurofilter  should  not 
only  provide  satbfisctory  signal  processing  over  the  nominal^dynamic  range  of  the  signals,  but  should  also  be 
robust  and  miuntiun  its  performance  in  the  presence  changes  in  the  nominal  dynamics  of  the  signab.  From 
that  perspective,  linear  and  non-linear  feedforward  neural  networks  were  trained  to  filter  noise  by  learning  to 
map  sequences  of  nouy  input  data  onto  the  exact  values  of  the  most  recently  sampled  data  [!}.  Comparative 
performance/robustness  evaluatioiu  indicated  that  the  synthesbed  non-linear  neurofilter  performed  better 
than  the  linear  neurofilter  within  the  nominal  dynaouc  range  of  signab;  whereas  the  linear  neurofilter  was 
more  robust  in  the  presence  of  substantial  variations  in  the  parameters  of  the  signal  generating  process.  Thu 
result  pointed  to  the  need  for  a  more  global  neural  architecture  with  a  potential  to  synergbtically  combine 
the  complementary  benefits  of  linear  neurofiltering  and  conventional  non-linear  neurofiltering. 

To  address  that  issue,  a  g^-scheduling  neural  network  (GSNN)  architecture  u  proposed  to  find  the 
optimal  combination  of  linear  and  non-linear  neurofiltering  that  provides  the  best  signal  estimates  from  input 
sequences  of  noby  data.  The  system  functionality  of  the  gun-scheduled  neurofilter  b  briefly  introduced  in 
section  2,  while  section  3  describes  the  gun-scheduling  truning  uchitecture  itself.  In  Section  4,  the  nominal 
performance  and  robustness  of  the  gain-scheduled  neural  network  are  compared  to  those  of  the  linear  and  non- 
lineu  neurofilters  separately,  while  Section  5  discusses  possible  extensions  towuds  performance/robustness 
enhancement,  non-linear  adaptive  neurofiltering,  and  neurosmoothing. 

2.  System  Functionality  of  the  Neurofilter. 

The  system  functionality  of  the  neurofilter  U  illustrated  in  Fig.l  in  the  context  of  an  aerospace  control 
system  application.  The  signab  to  be  filtered  ue  the  simulated  pitch-rate  responses  to  both  pitch  rate 
and  velocity  commands.  The  closed-loop  system  includes  a  non-linear  neurocontroller  designed  in  Re&.[2-3] 
to  provide  independent  control  of  pitch-rate/airspeed  for  a  state-space  representation  of  a  modern  fighter 
aircraft  [4].  The  plant  model  consbts  of  an  integrated  urframe/propubion  linear  model,  a  fuel  flow  actuator 
modelled  as  a  linear  second  order  system  with  position  and  rate  limib,  and  a  thrust  vectoring  actuator 
modelled  as  a  linear  first  order  system  with  position  and  rate  limits.  As  a  result,  the  signal  generating 
process  represented  by  the  closed-loop  control  system  of  Fig.l  contains  nonlinearities  due  to  the  actuator 
position/rate  limits,  uid  the  nonlinear  structure  of  the  neurocontroller.  For  the  purpose  of  this  study,  the 

‘Svodnip  Tcdmologx,  bic.,  3001  Aerospace  Parkway,  Brook  Park,  Ohio,  44142. 
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noiie  source  hu  been  placed  outside  of  the  control  loop  so  that  a  clean  baseline  signal  would  be  available  for 
comparison.  The  purpose  of  the  tr^ed  neurofilter  is  to  provide  an  estimate  of  the  actual  data  values  that 
have  been  corrupted  by  noise  in  order  to  enhance  rmy  subsequent  processing  by  out- of-Ae- loop  peripheral 
modules  such  as  failure-detectors  and  failure-identifiers  (e.g.  Ref.[5]),  off-line/on-line  system-identifiers  (e.g. 
Ref.[6]),  damage  estimators  (e.g.  Ref.[7]),  etc. 

In  this  simulation,  the  information  needed  to  synthesise  the  neurofilter  is  provided  by  closed-loop  pitch 
rate  responses  to  input  commands  iss£(t)  =  (4sn£(f)i  V5ni;(t)),  where  qsBL(i)  in  the  pitch  rate  command 
input,  and  vsai,{t)  is  the  velocity  command  input.  The  pitch  rate  command  input  gsBi(t)  is  a  doublet 
randomly  centered  at  a  time  between  2.5s  and  5s  such  that  qsBi,(l  <  U)  =  Qot  qsBL{U  <  t  <  2te)  =  -Qo, 
uid  95a£(2ie  <  t)  =  0,  as  indicated  in  Fig.2a.  The  concurrent  velocity  command  input  is  the  step  function 
<  0)  =  0  and  vsbl(0  <  i)  =  Vq,  aa  indicated  in  Fig.2b.  These  commanded  inputs  qsBi(l) 
vsBi>(t)i  which  represent  the  frequency-content  of  typical  pilot  command  inputs,  were  subsequently  filtered 
through  a  pnfiHer-for-commani-$haping  (Fig.l)  in  order  to  generate  the  commanded  trajectories  Ze{t)  = 
(9e(t),  Vc(t))  that  are  to  be  tracked  by  the  closed-loop  control  system.  The  commanded  pitch  rate  response 
9c(t)  and  the  commanded  velocity  response  Ve(t)  corresponding  to  a  doublet  pitch  rate  command  input 
9sBt{l)  suid  a  step  velocity  command  input  vsbl{^)  <ure  represented  in  the  diagrams  of  Fig.2.  The  maximum 

intensities  |Qo|  nnd  |Vo|  of  the  randomly  selected  input  conunands  were  bounded  by  Q - =  Sdeg/sec 

(corresponding  to  0.5  inches  of  pilot  stick  deflection),  and  Vnuu  =  20ft/a.  The  pitch  rate  responses  to  such 
randomly  generated  pilot  coiiunand  inputs  were  sampled  every  A  =  10ms  over  T  =  14s,  and  they  were 
corrupted  with  additive  gaussian  white  noise  with  a  standard  deviation  oiruntny  =  0.3deg/aec  before  being 
passed  to  the  training  architecture  of  the  neurofilter. 

3.  Gain-Scheduling  T)raining  Architecture. 

The  proposed  neurofilter  consists  of  a  linear  neural  network  and  a  non-linear  neural  network  with  op¬ 
timised  internal  configurations,  and  whose  outputs  are  modulated  by  a  gain-scheduling  feedforward  neural 
network.  The  optimised  linear  neural  network  and  the  optimised  non-linear  neural  network  used  in  this 
simulation  were  triuned  in  Ref.[l]  with  the  training  architecture  shown  in  Fig.3.  During  training,  the  in¬ 
puts  of  these  two  neurofilters  consisted  of  sequences 'of  the  fifty  most  recently  sampled  noisy  data,  and  the 
target  values  were  the  exact  values  of  the  last  sampled  data.  In  Fig.3,  the  notation  F^{p,  h,  1)  represents 
a  feedforward  neural  network  with  p  input  units,  a  single  hidden  layer  of  h  sigmoidal  neurons,  and  a  single 
linear  output  neuron.  Both  linear  and  non-linear  neurofilters  were  trained  to  minimise  the  error  {q  -  q)^(t) 
between  the  filter  output  q{t)  and  the  exact  value  q{i)  of  the  pitch  rate  signal  generated  as  in  Section  2. 
The  optimised  network  configurations  of  these  two  types  of  neurofilters  were  F'^(50, 30, 1)  for  the  non-linear 
neurofiltering  (i.e.  50  inputs,  30  hidden  sigmoidal  neurons,  and  1  linear  output  neuron),  and  F*{50, 1)  for 
the  linear  neurofiltering  (i.e.  50  inputs,  and  1  linear  output  neuron). 

As  shown  in  Fig.4,  the  ‘iiision”  of  the  optimised  linear  and  non-linear  neurofUters  is  achieved  by  training 
a  gain-scheduling  neural  network  to  minimise  the  error  {qcsifN  —  9)’(f)  between  the  Gain-Scheduled  Neural 
Network  output  qasNNil)  aod  the  exact  value  q{t)  of  the  pitch  rate  signal  generated  as  in  Section  2.  As 
indicated  in  Fig.4,  the  gun-scheduled  neurofilter  estimate  q{t)gsNN  **  ^  adaptive  combination  of  the  non¬ 
linear  neurofilter  estimate  9(t)„on- linear  >  linear  neuroUter  estimate  9(t)ii„e„= 

qaSNNit)  =  9(t)  X  9(t)„„„_K„,„  +  (1  -  9(t))  X  9(t)„„„,  (1) 

where  the  gun  g{t)  is  the  output  of  the  non-linear  gun-scheduling  neural  network.  The  role  of  the  gain¬ 
scheduling  neural  network  is  therefore  to  find  the  optimal  combination  of  linear  and  non-lineu  neurofiitering 
that  extracts  the  best  signal  estimates  from  input  sequences  of  noise-corrupted  data.  In  order  to  facilitate 
this  ’’classification”,  the  inputs  of  the  gun-scheduling  neural  network  were  chosen  to  be  filter  estimates  of 
the  exact  signal  values  instead  of  the  original  noisy  data  values.  These  filter  estimates  were  furthermore 
chosen  to  be  the  computed  outputs  of  the  linear  neurofilter,  in  light  of  the  robustness  ctdvantage  that  lineu 
filtering  has  over  conventional  non-linear  neurofiitering.  The  configuration  of  the  gain-scheduling  neural 
network  chosen  in  this  application  consisted  of  twenty  five  input  units,  ten  hidden  sigmoidal  neurons,  and  a 
lineu  output  neuron  with  the  thresholding  activation  function  y{x): 

y{x  <  0)  =  0;  y(0  <  *  <  1)  =  ®;  y(l  <  x)  =  1  ,  (2) 

and  training  was  performed  with  the  backpropagation  algorithm  [8-9]. 
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4.  Comparative  Nominal  Performance  and  Robustness  Evaluations. 

The  fthUitjr  of  the  linear,  non-linear,  and  gain-scheduled  neuroiilters  to  remove  the  noise  from  the  pitch 
rate  response  to  a  given  pilot  commanded  input  "c”  is  measured  by  the  ratio  Rg 


T  being  the  duration  of  the  pilot  command  input,  and  A  the  sampling  time  of  the  vehicle  outputs.  In  Eq.(3), 
q{tk)  is  the  exact  pitch  rate  response,  n(it)  is  the  white  noise  fluctuation  added  to  q{tk),  and  q{tk)  is  the 
Alter  output  corresponding  to  an  input  sequence  of  p  sampled  noisy  data,  i.e.  +^(4-t)i  ^*n{k,p)  > 

i  >  0}. 

To  compare  the  performances  of  the  aforementioned  neuroAlters,  two  measures  “R”  rmd  “r”  based  on 
Eq.(3)  are  introduced  [1].  The  JZ-measure  is  a  statistical  average  of  Rg  calculated  over  the  whole  dynamic 
range  of  pilot  command  inputs  as  characterised  in  Section  2  by  (Qo>  Vo,  te)  where  Qo,  Vn,  and  tg  are  uniformJy 

distributed  over  [-Q _ ,  -]•  [— and  (2.5s,  5s]  respectively.  The  r-measure  is  the  value 

of  Rg  for  a  most  demanding  case  of  pilot  command  input  corresponding  to  the  pitch  rate  doublet  QsBL(t  < 
5sec)  =  Qmmm>  QsBt(5sec  <t  <  lOsec)  =  -<7nMs,  QsBL(iOsee  <  t)  =  0;  and  the  velocity  step  Vsscit  < 
0)  =  0  and  <  t)  =  V— —  The  A-measure  grades  the  average  efflciency  of  a  neurofllter  in  removing 

the  noise  over  an  exhatistive  set  of  pilot  command  inputs,  whereas  the  r-measure  estimates  the  Altering 
eflSiciency  for  one  of  the  worst  cases  of  pilot  command  inputs.  To  test  the  ability  of  the  neuroAlters  to 
operate  at  noise  levels  other  than  that  used  in  training,  the  R-  and  r-  measures  were  evaluated  with  gaussian 
white  noise  of  various  standard  deviations  ranging  from  ffmin  =  0  to  Omas  =  Idegfaec.  The  values  of  the 
R-  and  r-  measures  corresponding  to  the  nominal  dynamic  range  of  the  signals  are  plotted  in  Figs.Aa  k  6a 
respectively.  The  results  show  that  the  gain-scheduled  neurofllter  outperforms  both  the  optimised  linear 
Alter  and  the  optinuied  non-linear  neurofllter,  not  only  at  the  noise  level  used  in  training,  but  also  at  all 
noise  levels  between  Omim  =  0  and  Vnuu  =  Ideg/aec. 

To  further  compare  the  robustness  of  the  g^-scheduled  neurofllter  with  the  robustness  of  the  optimised 
linear  neurofllter  and  non-linear  neurofllter  respectively,  the  R~  and  r-measures  were  also  evaluated  on  a 
test  set  extending  beyond  the  nominal  dynamic  range  of  the  signals  (used  for  training)  and  generated  as 
follows.  The  matrix  elements  of  the  A,  B,  and  C  matrices  of  the  vehicle  model  [4]  were  randomly  varied 
within  ±50%  of  their  nominal  values,  with  the  sole  requirement  that  the  stability  of  the  closed-loop  system 
be  preserved  [2].  Due  to  the  severity  of  the  deviations  of  the  A,  B,  C  matrices  from  their  nominal  values, 
the  closed-loop  system  responses  to  typical  pilot  command  inputs  presented  significant  deviations  from  the 
nominal  responses.  The  statistiesJ  evaluations  of  "R*  and  “r”  are  plotted  in  Figs.5b  k  6b  respectively  for  a 
typical  set  of  A,  B,  and  Ce  leading  to  large  variations  of  the  vehicle  model.  The  results  show  that  the  gain- 
scheduled  neurofllter  still  outperforms  the  optimised  linear  filter  and  the  optimised  non-linear  neurofllter 
at  aU  noise  levels.  This  is  graphically  illustrated  in  Fig.7  by  the  filtering  of  the  pitch  rate  response  to  the 
most  demanding  pilot  command  input  of  the  vehicle  model  with  the  same  set  of  oif-nominal  A,  A,  and  C 
matrices  as  that  used  for  the  evaluations  of  the  R~  and  r-measures  plotted  in  Figs.flb  k  fib  respectively. 
As  shown  by  the  plots  of  Fig.Ta,  7b  k  7c,  additive  gaussian  white  noise  u  more  efficiently  removed  &om 
the  noisy  closed-loop  signals  by  the  gun-scheduled  neurofllter  (7c)  than  by  the  optimized  linear  neurofllter 
(7a)  or  the  optimised  non-linear  neurofllter  (7b)  separately.  The  syner^stic  benefits  of  the  newly  proposed 
gain-scheduling  architecture  are  even  more  apparent  when  comparing  Figs.  7a,  7b  k  7c  in  light  of  the  plot  of 
the  gain-scheduling  neural  network  output  (identical  to  the  output  gain  of  the  non-lineu  neurofllter)  shown 
in  Fig.7d.  This  comparison  indicates  that  the  gain-scheduled  neuroflltering  presents  the  characteristics  of 
linear  neuroflltering  around  1  sec  and  6  sec,  i.e.  when  the  pitch  rate  estimates  of  the  linear  neurofllter  are 
better  than  those  of  the  non-Iinerur  neurofllter.  More  specifically,  Fig.7d  also  indicates  that,  around  1  sec, 
the  gain-scheduled  neurofllter  estimate  consists  of  about  80  %  of  linear  neurofllter  estimate,  and  about  20 
%  of  non-linear  neurofllter  estimate.  Around  fisec,  the  gmn-scheduled  neurofllter  estimate  is  100  %  of  the 
linear  neurofllter  estimate.  Otherwise,  the  giun-scheduled  neurofllter  estimate  is  for  the  most  given  by  the 
non-linear  neurofllter  estimate,  e.g.  above  12  sec  where  it  is  100  %  of  the  non-linear  neurofllter  estimate. 
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5.  Conclusion. 

A  Gain-Scheduling  Neural  Nettvork  Architecture  has  been  proposed  to  enhance  the  robustness  of  feed¬ 
forward  neurofUters,  and  was  analysed  in  the  context  of  the  noise-filtering  of  pitch  rate  responses  to  pilot 
command  inputs  for  a  modem  fighter  aircraft  model.  The  proposed  architecture  consists  of  an  optimised 
linear  feedforward  neurofilter,  an  optimised  non-linear  feedforward  neurofilter,  and  a  gain-scheduling  feed¬ 
forward  neural  network  which  is  tr^ed  with  backpropagation  to  synergistically  combine  the  complementary 
benefits  of  the  linear  and  non-linear  neurofilters.  The  resulting  gain-scheduled  neurofilter  consistently  per¬ 
formed  better  than  each  neuroiilter  separately,  within  the  nominal  as  well  as  off-nominal  dynamic  range  of 
the  simulated  signals. 

Future  areas  of  research  irould  include  possible  extensions  of  the  functionality  and  scope  of  the  pro¬ 
posed  gain-scheduling  neural  network  architecture.  Of  particular  interest  would  be  the  possibility  of  further 
enhancing  neurofiltering  through  the  gain-scheduling  of  a  collection  of  linear  filters  that  would  have  been 
separately  optimised  on  the  diiQoint  elements  of  a  partition  ci  the  space  of  the  input  signals.  The  synthesis  of 
the  multi-output  gain-scheduler(s)  required  for  the  fusion  such  optimised  linear  neurofilters  could  benefit 
from  the  robustness  of  genetic  algorithnH  or  even  fussy  rule-based  scheduling,  or  from  training  algorithms 
like  those  developed  for  the  hierarchical  mixing  of  expert  neural  networks  [10]. 

Of  additional  interest  would  be  the  posubility  to  extend  the  proposed  architecture  to  achieve  non-linear 
adaptive  neurofUiering  through  the  synergy  of  supervised  and  unsupervised  trying  schemes,  and  by  taking 
advantage  of  the  on-line  learning  capabilities  of  neural  networks.  An  important  practical  issue  to  be  addressed 
in  that  regard  would  be  whether  neural  networks  can  be  trained  in  unsupervised  training  modes  to  efficiently 
gain-schedule  the  supervised  training  of  a  partition  of  individual  neurofilters  of  the  type  proposed  in  Ref.[l  Ij. 

Of  further  interest  would  be  the  possibility  to  extend  the  proposed  architecture  to  the  smoothing  of  noisy 
signals  by  trying  a  neural  network  to  gun-schedule  optimised  linear  and  non-linear  neuroemootkers  that 
would  have  been  previously  trraed  to  map  sequences  of  p  successively  sampled  noisy  data  onto  the  exact 
values  of  any  of  the  previous  (p—  1)  samples  input  to  the  network.  Such  gain-scheduled  nevrosmoothen  would 
be  expected  to  provide  better  signal  estimates  than  their  neurofiHer  counterparts  in  view  of  the  additional 
information  provided  [11-12],  yet  at  the  expense  of  the  time  corresponding  to  the  delay  needed  for  the  signals 
to  be  avulable.  How  to  rea^  the  best  compromise  between  "accuracy”  and  "time”  would  therefore  depend 
upon  the  computational  requirements  and  characteristics  of  the  specific  post-processing  to  be  performed  on 
the  signak. 

Finally,  future  comparative  analysis  with  other  traditional  techniques,  such  as  Extended  Kalman  Filtering 
[13],  could  also  provide  insight  on  how  to  improve  the  performance  and  broaden  the  applicability  of  the 
proposed  Gun-Scheduling  Neural  Network  approach. 
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Figure  1.— Functional  system  diagram  c(  the  trained  neurofllter. 


(a)  Doublet  pitch-rate  centered  at  time  t  •  t^.  (b)  Step  velocity. 

Figure  2.— Pilot  command  Input,  ZsQ^(t)  -  (gag  (t).  Vgp]  (t)).  and  commanded  trajectory  2^(1)  >  (q^(t),  V(.(t)). 
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Flgiira  3.— Training  architecture  of  asymmetric  neurofliters  (p,  h,  1)  with  p  input  units,  one  iinear  output,  and  a  singie 

hidden  iayer  of  h  sigmoidal  neurons. 


Figure  4.— Training  architecture  of  the  gain-scheduiing  neurai  network. 
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Abstract 

A  multiresolution  learning  method  for  back-propagation  networks  is  proposed  in  this  paper. 
With  this  learning  method,  a  series  of  back-propagation  networks  are  built  to  leant  the  same  set  of 
input  vectors  under  different  resolutions.  After  a  network  has  been  trained  on  a  particular  resolution 
of  input  vectors,  the  connection  weights  are  transformed  into  the  next  network  which  is  responsible 
to  learn  a  higher  resolution  of  input  vectors.  The  objective  of  it  is  to  improve  the  convergence  rate 
of  the  networks.  Experimental  results  were  used  to  demonstrate  the  ability  of  this  approach. 


1  Introduction 

The  back-propagation  network  has  been  studied  for  many  years  and  many  researchers  have  applied  it 
to  a  wide  variety  of  problems  successfully  [9].  Unfortunately,  it  is  shown  that  the  back-propagation 
algorithm,  which  adopts  the  steepest  descent  technique,  is  slow  to  converge  in  a  multilayer  network 
[3,  4].  Such  limitation  prohibits  the  use  of  back-propagation  network  on  large  scale  problems,  e.g. 
problems  with  high  dimensionality  input  space.  The  multilayer  perceptron  assumes  the  individual 
input  neuron  acts  independently  from  the  other  neurons.  In  fact,  in  some  problems,  for  example,  image 
recognition  problems,  use  images  as  the  grey  level  input  to  the  network.  The  input  neurons  do  have 
some  correlations  with  their  neighboring  neurons.  However,  a  multilayer  perceptron  has  not  taken  this 
into  account. 

On  the  other  hand,  the  human  visual  system,  as  an  optimal  image  processor,  can  process  a  huge 
amount  of  information  quickly.  Studies  of  such  system  have  shown  that  the  retina  of  the  human  eye 
is  an  structured  array  so  as  to  see  a  wide  angle  in  a  low-resolution  way  using  peripheral  vision,  while 
simultaneously  allowing  high-resolution,  detailed  perception  by  the  fovea  in  a  small  central  portion  of 
the  viewing  region  [5].  This  finding  triggered  significant  interest  in  multiresolution  signal  decomposition 
and  some  researchers  [1,  8]  have  applied  this  multiresolution  technique  in  many  fields  of  applications, 
e.g.  edge  detection,  data  compression,  surface  interpolation,  and  shape  analysis.  Recently,  several 
researchers  incorporate  this  technique  with  neural  networks  [10,  12]. 

A  multiresolution  representation  of  a  signal  provides  a  simple  hierarchical  framework  for  interpretat¬ 
ing  the  information.  In  some  sense,  the  signal  at  a  coarse  resolution  provide  the  “context”  of  the  signal. 
It  is  natural  to  analyze  first  the  signal  at  a  coarse  resolution  and  then  gradually  increase  the  resolution. 
It  is  believed  that  such  coarse-to-fine  strategy  provides  a  possibility  for  reducing  the  computational  cost 
of  signal  operations  [8]. 

In  this  paper,  we  propose  a  problem-independent  learning  method,  which  adopts  the  multiresolution 
signal  decomposition  technique,  for  back-propagation  networks  in  order  to  alleviate  the  shortcomings 
of  this  kind  of  networks  described  above.  With  this  multiresolution  learning  method,  a  series  of  back- 
propagation  networks  are  trained  on  a  set  of  training  data  under  different  resolutions  and  we  believe 
that  the  convergence  rate  of  back-propagation  networks  can  be  improved,  e.g.  the  convergence  rate  is 
faster  than  the  original  one. 


2  Multiresolution  Approximation  of  L^{R) 

In  this  section,  we  review  the  basic  concept  of  multiresolution  analysis  introduced  by  Mallat  [6,  7]. 
Suppose  that  the  original  signal  f{x)  described  in  this  paper  is  measurable  and  has  a  finite  energy; 
f{x)  6  L^(R).  According  to  Mallat’s  definition,  we  can  define  the  multiresolution  approximation  of 
L\R). 

Definition  The  approximation  of  a  signal  /(x)  at  a  resolution  r  can  be  defined  as  an  estimate  of 
/(x)  derived  from  r  measurements  per  unit  length.  These  measurements  are  computed  by  uni¬ 
formly  sampling  at  a  rate  r  the  function  /(x)  smoothed  by  a  low-pass  filter  whose  bandwidth  is 
proportional  to  r. 

In  an  approximation  operation,  when  removing  the  details  of  /(x)  smaller  than  r,  the  highest 
frequencies  of  this  function  is  suppressed.  In  the  following,  we  discuss  only  the  approximation  of  a 
function  on  a  dyadic  sequence  of  resolution 

The  approximation  of  the  signal  /(x)  at  the  resolution  2^  ,  A2if(x),  is  characterized  by  the  set  of 
inner  products  as, 

AiJ  =  ((/{«), (u  -  ,  (1) 

where  =  2''^(2'^x)  and  ^(x)  €  L^iR)  is  a  unique  function  called  a  scaling  function.  A^jf  is  called 

a  discrete  approximation  of  /(x)  at  the  resolution  2^ .  In  practice,  a  physical  measuring  device  can  only 
measure  a  signal  at  a  finite  resolution.  For  normalization  purposes,  it  is  supposed  that  this  resolution 
is  equal  to  1  and  let  Aff  be  the  discrete  approximation  at  the  resolution  1  that  is  measured.  ^ 

Let  /f  be  a  discrete  filter  whose  impulse  response  is  h{n)  =  ~  ”))  and  let  H  be  the 

mirror  filter  with  impulse  response, 

h(«)  =  h{-n).  (2) 

Then,  it  can  be  shown  that  the  discrete  approximation  of  /(x)  at  a  resolution  2^,  A2if,  can  be  calculated 
by  filtering  Aji+i/  =  ((/(«),  ^2j+*  («  ~  discrete  filter  H  and  keeping  every  other 

sample  of  the  convolution  product, 

h(2n  -  k)  </(u),  .^2,4.  (u  -  j  (3) 

'  n^Z 

All  the  discrete  approximations  Aj,/,  for  j  <  0,  can  thus  be  computed  from  A\f  by  repeating  this 
process.  This  operation  is  called  a  pyramid  transform  and  the  set  of  discrete  approximations  (Ajj/) 
was  called  a  Gaussian  pyramid  by  Burt  and  Adelson  [1]. 

3  The  Multiresolution  Learning  Method 

The  multiresolution  learning  method  we  proposed  involves  a  series  of  back-propagation  networks.  Each 
network  is  responsible  to  learn  on  the  same  set  of  input  vectors  but  under  different  resolutions.  The 
sequence  of  training  processes  to  be  carried  out  by  the  set  of  back-propagation  networks  is  from  the 
coarsest  resolution  to  the  finest  resolution.  After  a  network  has  been  trained  on  a  particular  resolution 
of  input  vectors,  it  transfers  the  learned  information,  that  is  the  connection  weights,  to  the  upper  level 
network  which  is  required  to  learn  a  higher  resolution  of  input  vectors. 

3.1  Input  Vector  Representation 

Firstly,  let  us  define  how  input  vectors  are  represented  under  different  resolutions.  Let  {xj}  be  a  set 
of  yV-dimensional  input  vectors  where  xj  =  (xii,^i2i  -  -.*ivv)  and  x,j  6  3f.  It  has  been  mentioned  in 
Section  2  that  Af /  is  the  discrete  approximation  at  the  resolution  1  and  contains  a  finite  number  of 
samples.  Then,  we  can  define  the  discrete  approximation  of  xi  at  the  resolution  1,  Afxj,  as, 

AjXi  =  (®in)i<„<A^  •  (4) 

A  set  of  IV-dimensional  input  vectors  {Afxj}  can  thus  be  formed. 
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By  choosing  a  suitable  discrete  filter  H  and  applying  Equation  3,  the  discrete  approximation  of 
at  the  resolution  2-',  Aj,Xi,  can  also  be  defined  as, 

=  (  E  ,  (5) 

\*  =  -oo  /  1<„<2J/V 

where  Ajj^,Xi(k)  being  the  fc-th  element  of  Ajj+,Xi.  Hence,  all  the  discrete  approximations  ^42^x1,  for 
j  <  0,  can  thus  be  computed  from  xi  by  repeating  this  process. 

In  order  to  avoid  border  problems  when  computing  the  discrete  approximations  A^jXi,  it  is  supposed 
that  the  original  input  vector  A^xi,  is  symmetric  with  respect  to  n  =  1  and  n  =  N ,  i.e., 

{/lfxi(-n  +  2)  if  -N  +  2  <  n  <  1 

Ai£il2N-n)  ifiV<n<2Af-l  (6) 

0  if  n  <  — yv  +  2  or  n  >  2N  —  1 

If  the  chosen  discrete  filter  H  is  even,  e.g.,  H  =  H,  each  discrete  approximation  A^^Xi  will  also  be 
symmetric  with  respect  to  n  =  1  and  n  =  2^N. 


3.2  Back-Propagation  Network  Architecture 

After  several  sets  of  the  input  vectors  under  different  resolutions,  have  been  generated, 

we  build  a  group  of  back-propagation  networks  and  e£w:h  back-propagation  network  821  is 

responsible  to  learn  a  set  of  discrete  approximations  The  size  of  the  input  layer  for  each 

network  will  be  the  same  as  the  dimension  of  vectors  of  this  particular  resolution,  that  is  2^N-,  while 
the  size  of  the  output  layer  represents  the  number  of  categories  to  be  classified  in  the  input  vectors  and 
is  the  same  for  all  networks  generated. 

The  required  number  of  neurons  in  the  hidden  layer  greatly  depends  on  the  nature  of  the  problem 
to  be  solved  [3].  With  some  specific  knowledge  about  the  structure  of  the  problem,  and  a  fundamental 
understanding  of  how  the  back-propagation  networks  might  go  about  implementing  this  structure,  one 
can  sometimes  form  a  good  estimate  of  the  proper  network  size.  Like  the  size  of  the  output  layer,  the 
size  of  the  hidden  layer  is  the  same  among  all  back-propagation  networks  created. 


3.3  Training  Procedure  Strategy 

With  some  sets  of  vectors  under  different  resolutions  and  a  series  of  corresponding  back-propagation 
networks,  we  can  start  the  training  procedure.  First  of  all,  the  lowest  level  network  B2J  (the  network 
with  the  coarser  resolution  of  vectors  as  input)  is  trained  first.  We  initialize  the  connection  weights  of 
this  network  with  small  random  numbers  [9]  and  start  the  training  process.  Traditionally,  the  training 
process  of  a  back-propagation  network  is  repeated  until  a  minimum  on  sum  squared  error  (SSE)  or 
a  point  sufficiently  close  to  the  minimum  is  found.  However,  such  a  minimum  may  not  be  found  in 
the  networks  we  defined  except  the  highest  level  one  Bi  (the  one  with  the  original  input  vectors  Xi  ets 
input).  It  is  because  some  information  of  the  original  input  vectors  is  lost  during  the  approximation 
process. 

As  a  result,  we  define  an  Intermediate  stopping  criteria  for  terminating  the  training  processes  of  the 
back-propagation  networks  {B2i)j<_j<__^  which  are  trained  on  the  discrete  signals  })j<><-r 

Let  M  be  the  number  of  hidden  neurons,  T’N  be  the  number  of  input  neurons,  Wpq  be  the  connection 
weight  between  hidden  neuron  q  and  input  neuron  p  and  it  will  be  updated  with  Awp,  in  the  current 
training  cycle.  Hence,  a  term  W(t)  is  defined  as, 


M 

W(t)  =  {VNM)-^'£'E 

9=1  p=i 


Aw, 


PI 


"PI 


where  0  <  p  <  1,  called  a  history  factor,  and  (2-'iVA/)~' 
intermediate  stopping  criteria  is  then  defined  as. 


|  +  y>M^(f-i). 

is  used  for  normalization  purpose. 


(7) 

The 


W^(0  <  (8) 

where  >  0. 
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3.4  Connection  Weight  Transformation 

After  the  intermediate  stopping  criteria  is  satisfied  in  one  network  we  transfer  the  connection 
weights  of  it  to  the  next  network  with  higher  resolution.  From  the  lower  level  network,  we  have 

two  sets  of  connection  weights,  {wp,}  and  {v,r},  where  Wj,q  is  a  connection  weight  from  input  neuron 
p  to  hidden  neuron  q  and  v,r  is  a  connection  weight  from  hidden  neuron  q  to  output  neuron  r. 

Since  the  sizes  of  the  hidden  layer  and  the  output  layer  are  the  same  on  both  networks,  we  can 
simply  assign  {v,r}  to  the  higher  level  network  as  the  connection  weights  between  hidden  layer  and 
output  layer. 

However,  the  sizes  of  their  input  layers  are  different  and  we  need  to  do  some  transformations  on 
{wij}.  If  a  discrete  signal  Xj,ii  is  passed  to  the  lower  level  network,  the  hidden  neuron  q  will  receive, 

VN 

(9) 

p=i 

as  its  input.  In  order  to  maintain  the  same  status  after  transformation,  the  following  condition  must 
be  held  for  each  hidden  neuron, 

VN 

(tl>p,j42>Xi(p))  ,  (10) 

0=1  p  =  l 

where  vo'^  is  the  connection  weight  of  the  higher  level  network  from  input  neuron  o  to  hidden  neuron 
q.  We  can  worked  out  from  Elquation  5  and  10  that  the  connection  weights  are  equal  to. 


A(2P  - 

EpJ^  (a(2p  -  o)  +  A(2p  +  0-  2^+^N)  +  h{2p  +0-2))  u^p, 
Ep’2[M2p-2i+«Af)tnp, 


if  o  =  1 

if2<  o<2-'+‘Af- 1  .  (11) 
ifo  =  2>+*W 


4  Experimental  Results 


In  this  section,  we  show  some  computational  results  to  illustrate  the  performance  of  the  proposed 
learning  method.  A  numeric  recognition  problem  is  used  as  an  example.  For  this  problem,  10  biuary 
patterns  of  numbers,  from  0  to  9,  were  selected  as  training  examples  and  the  size  of  them  was  32  x  32. 

Two  experiments  were  carried  out  with  two  different  sets  of  initial  connection  >.'eights  for  back- 
propagation  networks.  In  each  experiment,  we  selected  3  history  factors  p  and  2  intermediate  stopping 
criteria  S,  i.e.,  p  =  0.8,  p  =  0.5,  p  =  0.2,  S  =  0.0005,  and  S  =  0.001,  for  the  training  processes  of 
the  proposed  learning  method.  To  demonstrate  our  method,  two  sets  of  network  structure  were  used, 
(Bjj)_2<j<o  and  (fl2i)-i<j<o.  and  were  called  the  3-level  network  set  and  the  2-leveI  network  set 
respectively.  Also,  a  1-Ievel  network  Bi  was  built  to  compare  the  performance  with  the  3-Ievel  aad  the 
2-Ievel  network  sets.  The  training  processes  of  all  networks  were  repeated  until  the  SSE  was  smaller 
than  0.01,  that  is  c  =  0.01.  Hence,  for  each  experiment,  there  was  a  total  of  13  training  jobs  to  be 
carried  out. 

As  shown  in  Equation  5  and  11,  we  must  first  define  the  impulse  response  h{n)  before  the  input 
vectors  can  be  represented  under  different  resolutions  atnd  the  connection  weight  transformation  can  be 
carried  out.  In  other  words,  h{n)  must  be  defined  since  h(n)  =  h{-n)  from  Equation  2. 

There  are  many  ways  of  choosing  these  coefficients  h(n)  [11],  as  long  as  En”-oo  *(’*)  =  Here,  we 
adopted  the  suggestion  from  Daubechies  [2], 


A(«)=| 


.UV3 

3W3 

3-y/3 

i-Vs 

8 

0 


if  n  =  — 2 
ifn  =  -1 
if  n  =  0 
if  n  =  1 
otherwise 


(12) 
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With  the  use  of  the  coefficients  shown  in  Equation  12,  Ek|uation  5  and  1 1  can  then  be  simplified 
into, 


/  7n  +  l  \ 

\*=2n-2  /  1<„<2JJV 


(13) 


and, 


A(0)t£/i,  +  A(2)ui,  +  A(2)u2, 
h{-l)w^^  +  h{l)ws^^ 
h(0)«;j,,  + /«(2)u;^^ 

h{-l}W2jN-\,q  +  h(-l)u'2jW,9  +  ^{l)“'2>Ar,9 
h{0)W2iS.q 


if  o  =  1 

if  0=2 

ifo=3,5,--,2^+*N-3 
ifo=4,6,  --,2-»+‘iV-2  ■ 
if  0=  2J+‘JV  -  1 
ifo  =  2-’+‘JV 


(14) 


Since  the  binary  patterns  used  in  the  experiments  were  all  in  two  dimensions,  the  multiresolution 
technique  described  in  Section  2  cannot  be  applied  to  them  directly.  However,  it  has  been  shown  that  the 
two-dimensional  multiresolution  transform  can  be  seen  as  a  one-dimensional  multiresolution  transform 
along  the  x  and  y  axes  [6].  We  first  convolve  the  rows  of  binary  patterns  with  a  one-dimensional  filter 
H,  retain  every  other  row,  convolve  the  columns  of  the  resulting  signals  with  another  one-dimensional 
filter  and  retain  every  other  column.  Hence,  two  sets  of  binary  patterns  can  be  collected  with  size 
16  X  16  and  8x8  and  they  are  used  as  input  vectors  for  networks  B^-i  and  respectively.  For  all 
networks  in  each  experiment,  the  sizes  for  the  hidden  layer  and  the  output  layer  were  set  to  15  and  10 
respectively. 

All  of  the  experiments  were  run  on  a  SPARCstation  10/30  with  32MB  memory.  Table  1  shows  the 
training  results  of  the  two  experiments.  The  convergence  time  for  each  training  job  is  presented  and  a 
performance  index,  a  ratio  to  the  convergence  time  of  the  1-leveI  network,  is  calculated. 


Table  1;  Training  results  of  the  three  experiments. 


Job 

no. 

Network 

Type 

1 

■ 

Experiment  1 

Experiment  2 

Convergence 
Time  (sec) 

Performance 

Index 

Performance 

Index 

1 

3-level 

0.8 

0.0005 

662.52 

11.16 

ijjHIggggiiii 

3-level 

0.8 

0.001 

3-level 

0.5 

0.0005 

955.62 

932.52 

5.87 

3-leveI 

la 

0.001 

1424.62 

5.19 

1384.48 

3.95 

5 

3-level 

m 

0.0005 

1274.13 

5.81 

1297.30 

4.22 

6 

3-level 

0.2 

0.001 

1983.52 

3.73 

2008.87 

2.72 

7 

2-leveI 

0.8 

0.0005 

2354.70 

3.14 

1780.05 

3.07 

8 

2-level 

0.8 

0.001 

2841.17 

2.60 

2139.13 

2.56 

9 

2-level 

m 

0.0005 

2986.17 

2.48 

2305.78 

2.37 

10 

2-level 

E9 

0.001 

3190.02 

2.32 

2646.95 

2.07 

11 

2-leveI 

m 

0.0005 

2995.55 

2.47 

2542.80 

2.15 

12 

2-level 

m 

0.001 

3268.58 

2.26 

2936.73 

1.86 

13 

1-level 

— 

— 

7396.93 

1.00 

5472.72 

1.00 

It  is  shown  in  Table  1  that  the  multiresolution  learning  method  improves  the  training  performance 
of  back-propagation  networks  significantly,  from  the  least  improvement  of  1.86  times  faster  in  the 
Experiment  2  to  the  best  improvement  of  11.16  times  faster  in  the  Experiment  1.  Generally,  the 
training  performance  increases  as  the  network  level  increases,  e.g.,  the  convergence  time  for  a  3-level 
network  is  shorter  than  the  one  for  a  2-ievel  network.  As  p  increases  or  S  decreases,  the  convergence 
rate  of  the  network  will  also  increase. 
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5  Discussion  and  Conclusion 


First  of  all,  let  us  investigate  why  the  training  performance  of  a  back-propagation  network  will  be 
improved  when  the  multiresolution  learning  method  is  used.  With  the  use  of  the  low  level  network 
in  the  the  learning  method,  the  training  examples  can  be  learned  in  a  lower  resolution.  Since  the 
architecture  of  the  low  level  networks  is  always  simpler  than  the  one  of  the  high  level  networks,  the 
low  level  networks  often  take  less  time  in  the  training  processes.  Even  though  the  examples  cannot  be 
fully  generalized  in  this  level,  the  low  level  networks  can  actually  reduce  the  overhead  for  the  high  level 
networks  in  some  extents.  The  function  of  the  high  level  networks  is  to  refine  the  generalization  rather 
than  start  it  from  the  beginning. 

As  it  is  shown  in  Table  1,  the  convergence  rate  increases  as  p  increetses  or  6  decreases.  It  is  quite 
e^lsy  to  be  understood  that  such  improvement  is  expected.  In  this  case,  the  low  level  network,  say 
Bn  contributes  more  in  the  whole  training  process  with  a  large  value  of  p  or  a  small  value  of  S  and  is 
allowed  to  learn  the  information  of  as  much  as  possible.  The  main  objective  of  the  high  level 

network  B2J+1  is  to  learn  the  difference  of  information  between  {A^j+ix}  and  {A^jx],  Usually,  the 
computational  cost  for  B2t  is  smaller  than  the  one  of  Bji-fi  • 

In  this  paper,  we  proposed  a  problem-independent  learning  method,  which  adopts  the  multiresolution 
signal  decomposition  techniques,  for  back-propagation  networks  in  order  to  alleviate  the  shortcomings 
of  this  kind  of  networks,  i.e.,  the  convergence  rate  is  slow.  Experimental  results  has  shown  that  our 
proposed  learning  method  improves  the  training  performance  of  back-propagation  networks  significantly. 
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ABSTRACT 

The  work  la  Intendad  to  notlvato  further  raaearch  in  the 
application  of  neural  networks  to  denodulatlon.  An  exaiople  is 
provided  of  iaproved  demodulation  of  bandwidth  efficient 
waveforms  with  a  three  layer  neural  network.  Simulation  results 
of  the  probability  of  error  with  this  demodulator  are  discussed. 


The  intent  here  is  to  tie  together  two  important  technological 
areas,  neural  networks  and  demodulation  in  digital  commun¬ 
ications.  The  paper  is  motivational  in  its  goal  .  The  result  is 
presented,  with  the  hope  that  this  will  spark  further  work  in 
this  area. 

At  first  glance  the  connection  between  neural  networks  and 
digital  communications  seems  obvious.  In  digital  communications, 
discrete  time,  quantized,  infomr.ation  samples  are  respresented  by 
or  modulate  individual  waveforms.  These  waveforms  are  sent 
through  a  transmission  cheinnel  where  they  are  usually  disturbed 
by  noise  and/or  other  interference  and/or  distortion  caused  by 
dispersive  phenomena  or  a  variety  of  other  deleterious  effects. 

At  the  receiving  end  the  disturbed  modulation  waveform  is 
presented  to  a  demodulator  which  attempts  to  extract,  without 
error,  the  information  Scimple  represented.  Demodulation  can  be 
viewed  as  detection  or  estimation  of  the  information  from  the 
received  waveform.  However,  alternatively,  demodulation  can  also 
be  viewed  as  a  simple  pattern  classification  problem,  with  the 
received,  disturbed,  modulation  waveform  being  the  pattern  and 
the  information  seimple  represented  being  the  prototype.  This  is 
a  task  well  suited  to  a  neural  network. 

Yet,  despite  the  obvious  connection  there  have  been  relatively 
few  reports  of  a  neural  network  approach  to  demodulation.  True, 
the  adaptive  equalizer  which  is  really  a  demodulator  has  been  in 
existence  for  several  decades  and  is  a  neural  network.  But,  it 
is  a  very  primitive  neural  network  having  only  a  single  layer  and 
not  really  exploiting  any  nonlinearity.  An  examination  of  the 
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open  literature  has  shown  very  little  work  in  applying  "modern, “ 
multilayer,  neural  networks  or  Hopfield  networks  to  the  problem 
of  demodulation.  References  [1],  [2]  and  [3]  provide  some 

connection  but  hardly  represent  much  attention  from  the  community 
as  a  whole. 

How  then  to  begin?  It  would  seem  that  neural  networks  would 
provide  the  greatest  advantage  to  demodulation  when  the  channels 
themselves  are  nonlinear  and/or  non-Gaussian.  These  are 
channels  disturbed  by  intermodulation,  limiter  based  distortion, 
co-channel  interference  and  dispersive  effects.  After  all  the 
nonlinearities  present  in  neural  networks  may  be  brought  to  bear 
on  this  communication  situation  in  the  same  way  they  are  brought 
to  bear  on  nonlinear  control  problems.  Furthermore,  the  existing 
techniques  for  demodulating  in  such  circumstances  are  far  from 
optimal.  However,  this  is  what  we  hope  to  motivate  and  is  beyond 
the  scope  of  the  present  paper. 

Rather,  the  problem  is  picked  up  by  looking  at  the  standard 
Additive  White  Gaussian  Noise  (AWGN)  channel  and  applying  a 
neural  network  to  getting  greater  bandwidth  efficiency.  AWGN 
channels  are  linear.  The  issue  of  bandwidth  efficiency, 
(modulation  schemes  which  represent  more  bits  per  Hz)  is  itself 
important  as  the  information  age  explodes  and  the  electromagnetic 
spectrum  is  taxed  to  the  limit  in  both  cable  and  wireless 
communications . 

Let  us  begin  our  motivational  work  by  looking  at  the  problem  of 
binary  digital  communication  on  the  AWGN  channel.  Specifically, 
consider  the  situation  where  information  is  transmitted  using 
binary  orthogonal  signals.  In  particular  consider  the  set  of 
signals  illustrated  in  Figure  1.  Here,  the  upper  signal 
represents  the  binary  digit  "0"  and  the  lower  signal  represents 
the  binary  digit  "1."  A  bit  is  transmitted  every  T  seconds.  If 
T=l/R  and  B=  the  modulation  signal  bandwidth  then  the  spectral 
efficiency  is  R/B  bits  per  Hz.  The  actual  bandwidth  varies  with 
the  definition  used.  However,  in  any  measure  B  varies  inversely 
with  the  signal  "on-time"  which  in  this  example  is  T/2  seconds. 
This  signal  set  is,  of  course,  a  version  of  pulse  position 
modulation.  In  baseband  communications  it  is  often  referred  to 
as  Manchester  encoding  and  preferred  for  its  synchronization 
capabilities . 

In  the  AWGN  channel  this  binary  orthogonal  modulation  set  is 
optimally  demodulated  by  a  pair  of  matched  filter  correlators, 
one  matched  to  So(t)  and  one  matched  to  Si(t).  The  index  of  the 
correlator  output  which  is  largest  is  the  bit  decision.  The 
probability  of  error  versus  Signal- to-Noise  Ratio  (SNR)  resulting 
from  this  optimal  approach  is  available  in  many  references,  (see 
for  example  [4])  and  indeed  is  the  same  for  any  pair  of 
orthogonal  signals  in  these  circumstances. 

Consider  now  a  slightly  altered  version  of  this  waveform  set, 
namely  the  binary  waveform  set  illustrated  in  Figure  2.  Here, 
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the  first  modulation  waveform  has  been  extended  to  the  right  and 
the  second  to  the  left,  each  by  T/2,  one  quarter  of  a  Baud 
period.  The  on-time  has  been  extended.  Consequently,  the 
bandwidth  is  reduced  by  approximately  33  1/3%.  This  is  a 
significant  increase  in  spectral  efficiency.  Of  course,  one 
could  have  gone  to  binary  antipodal  signalling  (PSK)  and  gotten  a 
100%  increase.  But,  we  can  say  that  this  would  "break  the  rules" 
by  not  allowing  negative  amplitude  values. 

The  waveforms  in  this  new  set  are  correlated.  They  are  no  longer 
orthogonal.  A  bank  of  matched  filter  correlators  is  no  longer 
the  optimal  demodulator.  How  should  these,  more  bandwidth 
efficient,  waveforms  be  demodulated?  One  could  do  nothing.  That 
is,  employ  the  now  sub-optimal  matched  filter  correlators.  In 
the  output  of  the  "incorrect"  matched  filter  correlator  there  is 
now  cross-talk.  In  the  present  example  this  amounts  to  an 
equivalent  reduction  of  SNR  by  about  5  db.  This  is  a  significant 
penalty  to  get  the  increased  spectral  efficiency. 

One  could  also  resort  to  an  improved  but  still  sub-optimal 
technique  of  somehow  estimating  the  cross-talk  and  subtracting  it 
from  the  "incorrect"  matched- filter  correlator  output.  This  is 
like  performing  a  Graim-Schmidt  orthonormalization  on  the 
modulation  waveform  set  and  using  the  resulting  waveforms  for 
correlation.  A  technique  employing  this  was  reported  in  Ref.  [5] 
and  is  essentially  a  type  of  special  equalizer.  However, 
estimation  of  the  cross-talk  itself  involves  noise.  Hence,  there 
is  an  enhancement  of  the  effective  noise  and  a  reduction  in 
equivalent  SNR.  This  is  always  a  phenomenon  associated  with 
equalization.  The  penalty  in  reduced  SNR  is  not  as  great  as  the 
case  of  doing  nothing.  The  increased  spectral  efficiency  does 
come  with  some  penalty  but  not  anything  like  5  db. 

We  have  investigated  the  use  of  a  three  (3)  layer  neural  network 
for  demodulating  the  bandwidth  efficient  modulation  ■  af.orm  set 
illustrated  in  Figure  2 .  But  before  describing  the  t  .itecture 
and  results  it  is  appropriate  to  ask  why  this  is  even  worth 
considering.  That  is,  why  is  this  an  alternative  to  the  sub- 
optimal  techniques  just  described  and  an  alternative  worth 
investigating.  Those  suboptimal  techniques  were  based  on  matched 
filter  correlation.  Correlation  is  an  exploitation  of  first 
order  statistics,  moments.  A  two  (2)  layer  neural  network  can 
automatically  do  this.  The  addition  of  a  third,  hidden,  layer 
allows  higher  order  statistics  to  be  dealt  with  and  features  not 
captured  in  correlation  processing. 

The  three  (3)  level  neural  network  architecture  which  we 
investigated  as  a  demodulator  is  illustrated  in  Figure  3 .  The 
architecture  consists  of  only  13  neurons.  The  input  layer 
captures  eight  (8)  uniformly  spaced  samples  of  the  received 
waveform.  These  are  designated  as  {S*{i),  i=l,...8}.  They  are 
uniformly  spaced  every  T/8  seconds  beginning  at  0+.  There  is  nc 
nonlinear  processing  in  the  input  layer.  The  hidden  layer 
consists  of  four  (4)  neurons.  To  avoid  an  overly  complicated 
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figure  we  have  only  shown  a  few  of  the  branches  with 
corresponding  weights.  Each  neuron  in  the  hidden  layer  uses  a 
sigmoid  as  the  activation  with  parameter  2 .  The  neurons  in  this 
layer  are  represented  by  circles  with  heavy  black  dots  in  the 
center.  The  third  or  output  layer  is  a  single  neuron  with  the 
same  type  of  sigmoid  behavior.  The  output,  Z,  for  purposes  of 
training  is  kept  as  a  real  number.  In  operation  it  is  quantized 
to  either  zero  (0)  or  one  (1)  based  upon  nearest  neighbor. 

This  neural  network  was  trained  using  back  propagation.  An  error 
acceptance  parameter  of  0.2  and  a  convergence  parameter  of  0.2 
were  used  throughout  training.  All  weights  were  initialized 
rcuidomly . 

We  examined  the  behavior  of  the  network  as  a  demodulator  using 
simulation.  This  was  accomplished  by  fixing  the  SNR  and  fixing 
the  size  of  the  SNR  and  fixing  the  size  of  the  training  set.  A 
specific  training  set  of  size,  N  (an  even  number)  was  generated 
as  follows.  The  first  two  waveforms  in  the  set  were  the  noise- 
free  modulation  waveforms,  So(t)  and  Si(t).  (N/2)-l  additional 
waveforms  were  then  generated  by  taking  So(t)  and  perturbing  it 
by  samples  of  randomly  generated  AWGN  at  the  SNR.  The  same  number 
of  additional  waveforms  were  generated  by  perturbing  Si(t) .  For  a 
given  SNR  and  training  set  we  measured  the  probability  of  error 
using  30,000  randomly  generated  testing  waveforms.  This 
limitation  arose  because  simulation  was  executed  on  a  386  PC  with 
an  accelerator  board. 

The  simulation  results  are  shown  in  Figure  4 .  Each  curve 
corresponds  to  a  different  SNR  and  shows  the  variation  of 
simulated  probadjility  of  error  with  the  size  of  the  training  set. 
Rather,  than  make  each  curve  a  "smooth"  interpolation  of 
simulated  measurements  we  have  preserved  the  actual  randomness. 
Nonetheless,  we  can  still  reach  a  firm  conclusion  which  is  our 
essential  result.  It  appears  that  in  each  case  the  probability 
of  error  approaches  that  for  binary  orthogonal  signalling  with 
increasing  training  set  size.  This  is  an  interesting  result.  We 
are  able  to  get  a  significant  improvement  in  bandwidth  efficiency 
with  no  penalty  in  SNR,  provided  training  is  sufficiently  long. 

What  should  be  the  next  step?  Before  proceeding  to  more 
complicated,  non-linear  channels,  one  should  look  at  the  same 
problem  of  bandwidth  efficiency  but  with  larger  modulation 
waveform  sets.  In  particular,  it  would  be  nice  to  derive,  at 
least  by  simulation,  the  equivalent  of  a  Shannon  Coding  Theorem 
for  neural  networks.  That  is,  what  degree  of  bandwidth 
efficiency  can  be  obtained  while  still  having  an  orthogonal 
signalling  probability  of  error? 
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Figure  1 

Example  set  of  binary  orthogonal  signals 


Figure  2 

Binary  modulation  waveform  set  which  is 
not  orthogonal 


Figure  4 

a.  Simulated  Measured  Probability  of  Error  vs  Number  of  Training  Waveforms 


b.  Simulated  Measured  Probability  of  Error  vs  Number  of  Training  Waveforms 

forSWR=6.5dB 


c.  Simulated  Measured  Probability  of  Error  vs  Number  of  Training  Waveforms 
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Abstract 

When  comfentional  BP  ANN  is  employed  for  image  encoding  the  decoded  images  nsnally  eihibit 
some  degradation  of  the  edges.  This  is  due  to  the  fact  that  edge  pixels  nsnally  represent  a  small 
portion  of  the  entire  image  and  BP  Wwifag  algorithms  do  not  differentiate  between  edge  and  non- 
edge  pixels.  In  thia  paper,  a  novel  Edge-Presenring  ANN  learning  algorithm  is  proposed.  This 
learning  algorithm  pays  more  attention  to  ed^  information.  The  error  between  the  compnted  and 
desired  ontpnt  valne  is  multiplied  by  a  wd^ting  factor  which  is  proportional  to  the  amount  of  edge 
information  in  the  corresponding  input  pixeL  The  algorithm  is  inqilemented  and  its  performance 
is  assessed  by  conqparing  it  to  the  conventional  BP. 

1  Introduction 

Successful  image  conqiression  using  back  propagation  neural  networks  has  been  demonstrated  in 
[1]  [2].  In  these  systems  the  decoded  image  usually  suffers  from  degraded  edges.  The  human  visual 
system  contdns  spedal  ceEs  in  the  brdn  which  are  very  sensitive  to  edges  [3].  This  suggests  that 
in  order  to  obtain  a  good  quality  image,  edge  infmmation  has  to  be  preserved. 

ANNs  are  mathematical  modds  of  theorised  mind  and  brdn  activity  [4].  T3rpically,  these  modds 
differ  in  thdr  topology,  way  of  learning,  and  way  <ff  recalling  information  [5]  [6]  [7].  For  instance, 
BP  ANN,  [8],  which  is  a  supervised,  feedforward  network  learns  by  making  wdght  connection 
adjustments  acemdiog  to  the  error  between  the  conqmted  and  desired  ontpnt  values.  All  ontpnt 
values  reedve  the  same  attention  regardless  of  the  information  they  represent.  However,  in  order 
to  avdd  edge  degradation,  it  would  be  desirable  for  edge  pixds  to  reedve  more  emphasis  (while 
learning)  than  the  rest  of  the  pixels. 

In  the  next  sectiem,  a  novel  ANN  learning  algorithm  which  preserves  edge  information  is  pro¬ 
posed.  lb  justify  this  proposed  leanung  algorithm,  simnlation  eiq)eriments  were  carried  out.  These 
e^eiiments  and  results  are  described  in  section  3,  a  disensdon  of  the  results  is  presented  in  section 
4,  and  the  paper  is  sununarised  in  the  conduding  section  5. 

2  Edge-Preserving  ANN  model 

In  this  modd,  the  error  between  the  computed  and  desired  output  value  is  multiplied  by  a  weighting 
fiKtor  whidi  is  proportional  to  the  amount  of  edge  information  in  the  corresponding  input  pixeL 
These  proposed  weq^ing  factor  can  be  calculated  by  using  the  Lapladan  operator  whidi  is  a 
second-ewder  derivative  operator  and  can  be  implemented  by  convolving  the  mask  shown  in  Figure  1 

^Concspondcnce  dunild  be  addressed  to:  iDkainelOwstiiow.iiwote(Ioo.co  (e-moil)  aod/ot  (519)888-4567  ext.  5761 
(pbanc). 
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Figure  1:  Mask  naed  to  compute  the  Laplacian 


erith  the  image.  Then,  the  absohite  value  of  this  Laplacian  of  each  pizd  is  normaUsed  by  dividing 
it  the  wut'riiniwn  absolnte  Laplacian  value  from  all  pixds  of  the  image.  Finally,  a  value  of  1  is 
added  to  each  pixel  of  the  absolute  normalised  Laplacian  image  to  get  a  weighting  factor  for  each 
image  ^xel  between  1  and  2.  The  purpose  of  the  added  1  is  to  maintain  the  effect  a  non-edge  pixel 
has  in  adjusting  ANN  wei^ts.  So,  when  the  proposed  weighting  factor  is  close  to  1  for  a  pixel  that 
has  almoet  no  edge  information,  its  effect  on  chMigitig  ANN  wogjhts  is  the  same  as  the  conventional 
BP  l—rning  algorithm.  On  the  other  hand,  when  it  is  close  to  2  for  a  pixel  that  rq;>resents  part  of  an 
edge,  its  effect  on  ANN  wri|^s  is  almost  double  what  it  would  be  using  the  conventional 

BP  algorithm.  Figure  2  (a)  riiows  Lapladan  edge-effect  enhanced  during  learning. 

After  learning,  an  image  block  is  canqpressed  by  forward  propagating  it  through  the  network, 
then,  quantising  and  saving  the  hidden  layer  unit  activations  instead  of  saving  the  ori^nal  block 
pixri  values.  Tb  uncompress  an  image  block,  these  quantised  values  are  presented  to  the  second 
half  of  the  network  then  the  reconstructed  block  is  generated  from  the  output  layer  unit  activations. 
Figure  2  (b)  shows  a  diagram  for  the  compression  and  decmnpression  operations. 
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Figure  2:  The  pn^Kwed  learning:  (a)  Proposed  Lapladan  edge-effect  enhanced  during  learning,  (b) 
The  con^ression  and  deconqnession  operations. 


S  Experimental  Results 

The  main  purpose  of  this  experiment  is  to  test  the  proposed  learning  algorithm  and  compare  its 
performance  with  the  conventional  BP  learning  algorithm.  In  this  experiment,  two  three-layer  BP 
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ANNt  were  ued.  Each  ANN  cooiists  of  16  Input  units,  4  hidden  units,  and  16  output  units.  The 
onlj  diffistence  between  these  two  ANNs  is  thdr  learning  algorithm.  In  the  first  ANN  the  proposed 
mo^fied  learning  is  emphqfcd  while  the  conventional  BP  learning  is  employed  in  the  second  ANN. 
The  input  to  these  two  ANNs  are  16  values  representing  a  4  by  4  image  block  pixel  values. 

The  learning  rate  and  the  number  of  learning  iterations  are  sdected  identical  in  both  ANNs. 
The  learning  data  set  is  generated  from  the  terminal  image,  shown  in  Figure  3  (a).  The  testing 
data  set  is  generated  from  the  hotel  image,  shown  in  Figure  3  (b). 


W  (b) 

Figure  3:  (a)  Learning  image,  (b)  Testing  image. 


Figure  4  (a),(b),(c),  and  (d),  shows  the  reconstructed  learned  image  and  the  absolute  error 
after  perform^  cooqpressiou  /  decotiqiression  using  both  of  the  ANN  learning  algorithms.  Six  bits 
were  used  to  quantise  the  activation  of  each  hidden  unit.  The  absolute  error  image  is  enhanced  by 
adding  a  value  of  128  to  eadi  pixd  value.  Figure  4  (e)  shows  a  table  for  the  mean  squared  error  of 
the  reconstructed  learned  image  for  both  learning  algorithms.  This  table  demonstrates  the  effect 
of  using  various  quantisation  levds. 

Figure  5  (a)  and  Figure  5  (b)  show  a  soomed  section  of  the  reconstructed  testing  image  after 
performing  conq>resrion  /  decompression  using  both  of  the  ANN  learning  algorithms  with  six  bits 
quantisation.  Figure  5  (c)  shows  a  table  for  the  mean  squared  error  of  the  reconstructed  testing 
image  for  both  learning  algorithms. 

4  Discussion 

The  proposed  Lapladan  edge-effect  enhanced  learning  shows  better  performance  in  the  form  of  less 
mean  squared  error,  as  well  as  appearing  to  produce  better  quality  of  reconstructed  image.  When 
Lapladan  edge  effect  enhanced  learning  is  employed  the  edges  are  preserved  more  than  when  regular 
BP  learning  is  enq>l(7ed.  This  is  because  the  proposed  learning  algorithm  gives  more  importance 
to  the  learning  o^  edge  information  in  the  image. 

As  shown  in  Figure  4  (e)  and  Figure  5  (c),  error  for  proposed  Lapladan  edge-effect  enhanced 
learning  is  consistently  lower  than  that  obtdned  using  regular  BP  learning  for  any  number  of 
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Quantisation 

bits 

8 

■ 

6 

5 

■ 

3 

2 

1 

Proposed  learning 

10213 

47.18 

48.21 

53.10 

79.36 

174.80 

585.17 

2473.30 

BP  learning 

48.95 

49.16 

50.71 

57.79 

87.80 

212.28 

695.70 

2428.47 

(e) 


Fignre  4:  (a)  and  (b)  Reconstructed  taught  image  after  performing  compression  /  decompression 
using  the  proposed  learning  algorithm  and  the  BP  learning  algorithm  respectively  with  6  bits 
quantisation,  (c)  and  (d)  absolute  error  in  image  (a),  (d)  respectively,  and  (e)  Reconstruction  mean 
squared  error  for  different  quantisation  levek  . 
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Quantisation 

bits 

8 

■1 

6 

5 

■1 

3 

2 

1 

Proposed  laMning 

112.82 

113.05 

114.36 

119.53 

144.33 

239.35 

527.14 

1861.18 

BP  learning 

135.31 

135.51 

136.31 

140.1 

156.82 

255.51 

591.37 

1456.55 

(c) 


Figure  5:  (a)  and  (b)  Zoom  on  a  part  of  tbe  reconstructed  testing  image  after  performing  com¬ 
pression  /  deconqiresuon  using  the  proposed  learning  algorithm  and  the  BP  Iwtming  algorithm 
respectively  with  6  bits  qnantisation,  (c)  Becanstmction  mean  squared  error  for  different  quanti¬ 
sation  levds  . 
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qaamUsatiaa  bits  used  ezc^t  with  1  bit  quantisation.  However,  in  the  latter  case,  it  does  not 
reaUj  matter  which  method  is  better  since  fidelity  would  be  extremely  low  for  both  due  to  high 
mean  square  error. 

Otherwise,  the  error  slightly  increases  with  decreasing  quantisation  level  for  the  hidden  layer 
unit  activatioiis.  This  is  true  down  to  a  cert^  level  of  qnantisatiott  (4  bits  in  this  case),  after 
wUch  the  error  increases  dramatically.  Below  this  quantisation  and  with  the  current  U-amitig  level, 
the  network  is  no  longer  able  to  encode  the  image  successfully. 

5  Conclusion 

Trained  ANNs  are  able  to  extract  the  desired  information  in  a  pven  image  and  encode  it  according 
to  a  predetermined  criterion.  This  criterion  is  ^ven  to  the  ANNs  during  learning  in  the  form 
of  trmning  exanqiles  and  the  learning  algorithm.  Edge  degradation  during  the  compression  / 
deconqtression  processes  can  be  reduced  by  adapting  the  ANN  learning  algorithm  so  that  it  can 
pay  more  attention  to  edge  information.  Potentially,  this  method  could  be  used  to  emphasise  other 
image  features  such  as  texture. 
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Abstract  This  paper  proposes  a  neural  network  that  learns  to  recover  the  original  random  signals 
from  their  linear  mixtures  observed  by  the  same  number  of  sensors.  The  network  acquires  the 
function  without  using  any  information  about  the  statistical  properties  of  the  sources  and  the 
coefficients  of  the  linear  transformation,  except  the  assumption  that  the  source  signals  are  statistically 
independent  and  nonstationary.  The  learning  rule  is  formulated  as  a  steepest  descent  minimization  of 
a  time-dependent  cost  function  that  takes  the  minimum  only  when  the  network  outputs  are 
uncorrelated  with  each  other. 

1.  Introduction 

This  work  deals  with  the  problem  of  how  the  original  signals  generated  by  some  stochastic 
sources  (e.g.,  voices  uttered  by  two  persons)  can  be  separated  from  their  linear  mixtures  observed  by 
the  same  number  of  sensors  (e.g.,  output  voltages  of  two  microphones).  Such  a  signal  separation  is 
called  "blind  separation",  when  it  must  be  performed  in  the  absence  of  any  special  information  about 
the  statistical  properties  of  the  sources  and  the  coefficients  of  the  linear  transformation,  except  the  fact 
that  the  source  signals  are  statistically  independent  of  each  other. 

It  can  be  shown  that  the  blind  separation  is  impossible  if  the  sources  are  stationary,  gaussian 
processes.  The  method  proposed  here  assumes  that  the  source  signals  are  nonstationary,  while  the 
conventional  methods  stipulate  that  they  are  nongaussian  [  1, 2,  3, 4,  5]. 

This  paper  proposes  an  adaptive  linear  network  which  acquires  the  function  of  blind 
separation.  It  is  achieved  by  iteratively  modifying  the  network’s  parameters  so  as  to  minimize  a  time- 
dependent  cost  function  that  takes  the  minimum  only  when  the  network  outputs  are  uncorrelated  with 
each  other.  It  is  shown  that  the  equilibrium  of  the  learning  dynamics  is  uniformly  asymptotically 
stable.  A  computer  simulation  is  also  given  to  demonstrate  the  validity  of  the  method. 

2.  Signal  Sources 

Suppose  that  random  signals  x'j(t)  (j=l,...,N)  are  generated  by  N  statistically  independent 
sources,  and  their  linear  mixtures  (affine  transformation)  s’j(t)  (i=l,...,N)  are  observed  by  N 
sensors: 

N 

s'i(t)  =  X  jW  +  aj  (1) 

j=i 

where  ay  and  a^  are  constants  independent  of  time  t.  Putting  Xj(t)  =  x'j(t)-<x’j(t)>  and  Sj(t)  =  s’j(t)- 

<s'j(t)>  (<*>  denotes  the  ensemble  average  of  *),  (1)  can  be  rewritten  as 
N 

Si(t)  =  X  aijXj(t)  (2) 

j=i 

We  here  assume  that  <x’j(t)>  (j=l,...,N)  are  constant  with  time,  implying  that  <s'j(t)>  (i=l . N) 

are  also  constant.  Then,  Sj(t)  can  also  be  considered  an  observable  signal  because  <s'j(t)>  can  easily 

be  estimated  by  a  time  average  of  s'j(t).  Henceforth,  we  call  Xj(t)  and  Sj(t)  as  source  signal  and 

sensor  signal,  respectively.  (2)  can  be  expressed  in  vector  notation  as 

s(t)  =  Ax(t)  (3) 

where  s(t)  =  [Sj(t),...,Sf^(t))^,  x(t)={x,(t),...,Xf^(t)]^,  and  A=[ajj]. 
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The  objective  of  this  paper  is  to  propose  a  neural  network  that  learns  to  recover  the  original 
signals  Xj(t)  (J=I,...,N)  from  the  sensor  signals  Sj(t)  (i=l . N)  in  the  absence  of  any  special 

information  about  the  properties  of  A  and  x(t).  but  it  should  be  noted  that  this  definition  of  signal 
separation  has  an  ambiguity.  Namely,  if  Xj(t)  (j=l . N)  are  source  signals,  then  d|Xp|(t),  .... 

dNXp^(t)  can  also  be  considered  as  source  signals,  where  { p  j . pjsj  |  is  an  arbitrary  permutation  of 

1 1,....  N|  and  dj,...,df^  are  arbitrary  nonzero  constants.  It  is  because  d|Xp|(t) . dj^^Xp^^CO  are 

also  statistically  independent  and  Si(t)  can  be  expressed  by  their  linear  combination  with  coefficients 
®i  p/'^l’  ^i  pi/^^N’  Henceforth,  we  therefore  define  the  signal  separation  as  a  process  providing 
any  of  the  following  type  of  signals: 

X(t)  =  DPx(t)  (4) 

where  P  is  a  permutation  matrix  and  D  a  diagonal  matrix  with  nonzero  diagonal  elements. 

The  assumptions  we  need  for  blind  separation  are  very  modest  ones,  as  follow's. 

Assumption  1  Matrix  A  is  nonsingular. 

Assumption  2  Xj(t)  (j=l,...,N)  are  statistically  independent  with  zero  mean. 

This  implies  that  the  covariance  matrix  R(t)  of  x(t)  is  a  diagonal  matrix: 

R(t)  =  diag(rj(t),....rp^(t)}  =diag{<Xj^(t)> . ,<Xj^^(t)>}  (5) 

where  diag{...}  represents  a  diagonal  matrix  with  diagonal  elements  {...}. 

Assumption  3  r-(t)  (i=l,...,N)  are  linearly  independent  functions. 

Namely,  the  following  equation  holds  only  when  Cj  =  0  (i=l,...,N). 

N 

Xciri(t)  =  0  (6) 

i=  1 

Here,  time-varying  functions  rjft)  (i=l,...,N)  are  considered  to  be  defined  in  the  time  interval  during 
which  the  sensor  signals  are  observed.  [We  shall  stipulate  a  slightly  stronger  condition  in  §4.] 

The  last  assumption  is  important  because  Assumptions  I  and  2  are  not  sufficient  to  realize  the 
signal  separation  in  general;  if,  for  example,  x(t)  is  a  stationary,  gaussian  process,  then  signal 
separation  is  essentially  impossible  (see  §6).  Under  Assumptions  1-3  we  can  prove  that,  if  the  same 
sensor  signal  s(t)  is  produced  by  source  signal  X(t)  as  well  as  by  x(t),  then  the  relation  (4)  must  hold. 
It  means  that,  about  the  definition  of  the  source  signals,  we  need  not  to  take  into  account  any  other 
ambiguio^  than  the  one  mentioned  above. 

3.  Signal  Separation  Network 

For  signal  separation  we  consider  a  recurrent  network  shown  in  Fig.l,  which  receives  sensor 
signals  Sj(t)  (i=l,...,N)  as  input  and  produces  outputs  yj(t)  (i=l,...,N).  The  dynamics  of  each 

output  unit  is  given  by  the  following  first-order  linear  differential  equation 

+  yi(t)  =  Si(t)  -  X  Cijyj(t)  (i=l  ,....N)  (7) 

j=i 

Here,  -Cy  (i,  j=l,...,N;  i^j)  represent  the  strengths  of  the  mutual  connections  between  the  output 

units,  and  they  change  slowly  according  to  an  adaption  rule  which  will  be  shown  later.  The  output 
units  have  no  self  connection;  c^=0.  (7)  is  expressed  in  vector  notation  as 

x^  +  y(t)  =  s{t)-Cy(t)  (8) 
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T 

where  y(t)  =  (y  j(t),...,yji^(t)]  ,  C  =  [cy],  and  tis  a  time  constant.  If  -(I+C)  is  a  stable  matrix  and 

time  constant  x  is  sufficiently  small,  then  the  network  function  (8)  can  be  replaced  by  the  following 
static  input-output  relation,  which  will  be  assumed  in  the  sequel. 

y(t)  =  (I  +  C)-*s(t)  (9) 


Fig.  I  Signal  separation  network 


Our  objective  is  now  to  determine  C  such  that  y j(t)  is  proportional  to  x  pj(t),  where  { Pj , . . . , 
Pj^l  is  a  permutation  of  We  call  C  so  determined  as  signal  separation  operator.  The 

general  form  of  signal  separation  operator  is  C  =  APD-I,  where  P  is  an  arbitrary  permutation  matrix 


and  D  is  an  arbitrary  diagonal  matrix  with  nonzero  diagonal  elements.  It  is  assured  as  y(t)  =  (C-hI)' 
U(t)  =  D'*P*^A**"Ax(t)  =  D'^P'^x(t),  which  is  essentially  equivalent  to  (4).  Moreover,  the 
constraints  CjpO  leads  to  D  =  [diag{AP}]‘*  (diag{*l  denotes  the  matrix  made  by  putting  the 

nondiagonal  elements  in  matrix  *  to  be  zeros).  Thus,  the  general  form  of  the  signal  separation 
operator  under  the  condition  Cjj=0  is  given  by 

C  =  AP[diag{APir^-l  (10) 

This  result,  however,  cannot  be  used  for  constructing  the  signal  separation  network,  because 
we  are  assuming  that  matrix  A  is  unknown.  Here,  we  give  the  following  theorem  which  is  useful  to 
obtain  the  signal  separation  operator  by  a  learning  process. 


Theorem  1  The  following  three  statements  are  equivalent, 

(i)  C  is  given  by  (10). 

(ii)  <yj(0yj(t)>  (i,j=l,...,N;  i^j)  are  zero  at  any  time  t. 

(iii)  The  following  nonnegative  scaler  function  takes  zero  at  any  time  t. 

N 

Q(C,R(t))  =  i{ X  log  <y?(t)>  -  log  l<y(t)y(t)T>l )  (11) 

^  i=i 


4.  Learning  Process 

From  Theorem  1  it  is  found  that  the  signal  separation  can  be  realized  by  determining  C  so  that 
(XC,  R(t))  will  take  the  minimum  value  at  any  time.  In  order  to  achieve  this  we  consider  the 
following  dynamics 


^  ^  aQ(C.R(t)) 
dt  dcjj 


(i,j  =  1,....N;  i?tj) 


(12) 


By  calculating  the  derivative  in  the  right-hand  side  we  have 

T ^  =  (I  +  €"•■)■' {(diag<y(t)y(t)’^>)'’<y(t)y(t)’^>  - 1 )  (13) 


Note  that  diagonal  elements  in  both  sides  of  this  equation  (TdCjj/r/t  =  ...)  are  immaterial:  Cjj's  are 

always  zero.  It  can  be  seen  that  (12)  or  (13)  is  to  attain  the  minimum  of  Q(C,  R(t))  by  a  steepest 
descent  method.  The  behavior  of  this  leaning  dynamics,  however,  is  not  so  simple  because 
Q(C,  R(t))  includes  a  time-varying  function  R(t),  but  we  can  prove  the  following  theorem. 


Theorem  2  Every  eqnilibriiun  of  equation  (13)  takes  the  form  of  (10),  and  it  is  (locally)  uniformly 


asymptotically  stable  under  the  following  condition.  [Equilibrium  defined  here  indicates  the  state  at 
which  C  does  not  change  at  any  time.] 


Assumption  3'  For  some  tQ,  Tq  (>0),  and  e  (>0).  the  following  inequality  holds  for  any  unit 
t+To 


2  1 

vector  [q  | ,  q2l  (q  |  +q2  =  • )  and  for  any  time  t  (^tQ). 


dt  >  e  >  0 


(14) 


This  condition  is  easily  attained  if  rj(t)  (i=l,...,N)  continue  to  fluctuate  somewhat  independently. 

In  order  to  actually  realize  (13),  we  need  to  estimate  <y(t)y(t)^>  in  real  time.  To  this  end  we 
can  use  the  following  moving  average  O  =  [<()-]  of  y  (t)y(t)^  under  a  certain  condition. 


T'^?^  +  <I)(t)  =  y(t)y(t)T' 


dt 


So,  (13)  becomes 


T^  =  (I  +  CT)-'{(diag<I)(t))‘<I)(t)-lj 


(15) 


(16) 


5.  A  Special  Case:  N=2 


Here,  we  shall  consider  the  special  case  of  N=2,  for  which  ( 1 3)  reads 


jdci2_  1  <yi(t)y2(t)> 

.  T^t  = 

1  <y2(0yi(t)> 

(17) 

dt  (I-C12C21)  <y2(t)> 

dt 

U-C12C21)  <y5(t)> 

According  to  (10),  it  has  two  equilibria  given  by 

Ci2  =  1^ .  C21  ^  ])  or 

a22  ’  ail  '  1  0  1  i 

—  ^22  /p  _  [  0  1  1  \ 

(18) 

According  to  Theorem  2,  both  the  equilibria  are  stable  in  terms  of  the  learning  dynamics  ( 1 3),  but  it 
should  be  noted  that  the  network  dynamics  (7)  must  also  be  stable  for  the  equilibrium.  It  can  be 

shown  that  only  one  of  these  satisfies  the  condition  |  C|')C2  j  I  <1  which  allows  the  network  dynamics 

to  be  stable. 

We  next  consider  a  simplified  learning  dynamics  obtained  by  eliminating  the  common  terms 
l/(l-Cj2C2j)  appearing  in  the  right-hand  sides  of  (17) 

•pden  _  <yi(t)y2(t)>  jdc2i  _  <y2(t)yi(t)> 
dt  <y|(t)>  ’  dt  <y|(t)> 

An  interesting  feature  of  this  learning  rule  is  that  it  is  a  variant  of  an  anti-Hebbian  rule  [6];  the 
connection  weight  -c-  decreases  proportionally  to  the  product  of  y  j(t)  and  yj(t)  but  with  time-varying 

rate  l/<yj  (t)>.  Obviously,  (19)  has  the  same  equilibria  (18)  as  the  original  dynamics  (17),  but  the 

stability  becomes  a  little  different;  one  equilibrium  satisfying  I  c  j2C2 1 1  <  I  is  solely  stable.  Namely, 

only  the  equilibrium  for  which  the  network  dynamics  is  stable  is  stable  also  with  respect  to  the 
learning  dynamics. 

We  here  show  a  computer  simulation,  in  which  the  differential  equations  previously  given  are 
transformed  into  difference  equations,  using  the  Euler  approximation.  For  source  signals  Xj(k)  and 

X2(k),  the  following  stationary  and  nonstationary,  gaussian  white  signals  are  used,  respectively: 

X  j  (k)  =  u  j  (k)  ,  X2(k)  =  T|(k)u2(k) 

where  U|(k)  and  U2(k)  are  both  the  gaussian  white  signal  with  zero  mean  and  unity  variance,  and 
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n(k)  was  given  by  n(k)  =  3  sin(Ji/200)k.  Matrix  A  is  given  as 

A=[  '  0.5] 

L  0.5  1  J 

(15)  and  (17)  become,  respectively, 

(tijj(k+l)  =  a(|>jj(k)  +  (1-  a)yj(k)yj(k)  (i,  j  =  1 .2) 

c,2(k+l)  =  c,2(k)-p-‘^-Hl!il  ,  C2,(k+l)  =  C2,(k)-p‘^-il^ 

<t>22(k) 

The  values  of  yj(k)  are  assumed  to  be  given  by  the  static  input-output  relation  (9).  The  parameters  of 

the  learning  dynamics  are  chosen  as  a  =  0.9  and  P  =  0.001,  and  the  initial  values  of  Cp,  C2  j,  and  (jty 
(i=l,2)  are  set  at  0, 0,  and  1,  respectively. 

Fig.  2  shows  the  plots  of  Xj(k)  and  yj(k)-Xj(k)  (i  =1,2).  Theoretically,  the  network  should 
learn  to  provide  the  output  as  yj(k)  =  Xj(k)  (i  =  1 ,2).  One  can  see  that  the  network  acquires  the 
desired  function  in  about  a  thousand  steps. 


y2(k)-x2(k) 
1 


1000 (steps) 


Fig.  2  The  plots  of  Xj(k)  and  yj(k)-Xj(k) 

6.  Discussion 

Suppose  that  x(t)  in  (3)  obeys  a  stationary  gaussian  process  with  constant  covariance  matrix 
R  =  (r|,...,rf^).  Then,  we  can  see  that  the  source  signal  x(t)=[Xj(t) . Xfyj(t)]^  and  the  linear 

transform  A  given  below  yield  the  same  sensor  signal  s(t), 

X(t)  =  DE^R**^x(t)  (20) 
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s(0  =  A  X(t)  =  AR*^ED''x(0  (21) 

where  D  is  a  diagonal  matrix,  E  an  orthogonal  matrix,  and  =  diagltj  . Note  that 

_  _  T*  ^  — 

<X(t)X(t)  >=0*"  is  a  diagonal  matrix,  i.e.,  Xj(t)  (i=l . N)  are  independent  of  each  other  because 

noncorrelation  is  equivalent  to  independence  in  gaussian  processes.  The  arbitrariness  of  E  implies 
that  it  is  essentially  impossible  to  recover  x(t)  from  s(t)  with  the  ambiguity  of  (4).  This  fact  means 
that  C  minimizing  (^C,  R)  is  not  an  isolated  point  but  fomis  a  hypersurface  in  N(N- 1 )  dimensional 
space  of  Cjj's  (i^j).  So,  along  with  learning,  C  approaches  any  point  on  the  surface,  depending  on  its 

initial  value. 

If  R(t)  changes  with  time,  the  situation  becomes  different.  Let  us,  for  example,  consider  the 
case  that  R(t)  (N=2)  takes  R  j  and  R2  alternately  (see  Fig.3).  When  R(t)=R  j ,  C  moves  to  the  curve 

determined  by  Q(C,Rj)=0,  and  when  R(t)=R2,  C  moves  toward  the  curve  Q(C,  R^)=0.  As  a 
result,  C  converges  to  a  cross  point  of  the  two  curves,  i.e.,  the  desired  equilibrium  (10). 


Fig.  3  Trajectory  of  C  in  the  case  that  R(t)  takes  two  values  alternately 
7.  Conclusion 

We  have  described  a  neural  network  that  self-organizes  to  recover  the  original  signals  from  the 
sensor  signals.  It  is  performed  without  any  particular  information  about  the  statistical  properties  of 
the  sources  and  the  coefficients  of  the  linear  transformation,  except  the  fact  that  the  source  signals  are 
statistically  independent,  nonstationary  signals. 

Acknowledgement  The  authors  thank  Dr.  S.  Kurogi  and  Dr.  M.  Ohya  for  giving  useful 
suggestions.  This  work  was  supported  by  the  Grant-in-Aids  for  the  Scientific  Research  by  the 
Ministry  of  Education,  Science  and  Culture  of  Japan,  No.  05680301 . 

References 

[1]  Comon,  P. "Separation  of  stochastic  processes  whose  linear  mixture  is  observed",  Proc.  ONR- 
NSF-IEEE  Workshop  on  Higher-Order  Spectral  Analysis,  Vail,  Colorado,  1989,  pp.  174- 179. 

[2]  Jutten,  C.  &  Herault,  J. "Blind  separation  of  sources.  Part  I;  An  adaptive  algorithm  based  on  a 
neuromimetic  architecture".  Signal  Processing,  2  4(  1 ),  1 99 1 ,  pp.  I  - 1 0. 

[3]  Comon,  P.,  Jutten,  C.,  &  Herault,  J. "Blind  separation  of  sources.  Part  II:  Problem  statement". 
Signal  Processing,  24(1),  1991,  pp.l  1-20. 

[4]  Sorouvhyari,  E."Blind  separation  of  sources.  Part  III:  Stability  analysis".  Signal  Processing, 
24(1),  1991,  pp.21-29. 

[5J  Burel,  G. "Blind  separation  of  sources:  a  nonlinear  neural  algorithm".  Neural  Networks,  5, 
1992,  pp.937-947. 

[6]  Matsuoka,  K.  &  Kawamoto,  M.  "A  neural  network  that  self-organizes  to  perform  three 
operations  related  to  principal  component  analysis".  Neural  Networks,  to  appear. 


m-70 


STEFANIA  MARCHINI*md  N.  ALBERTO  BORCHESE** 

*Dipartimmto  di  Scienxe  deUlt^bntuakme,  Unwersity  of  fAilano,  L 
**lnstitute  Neurosdmx  Biomages  CN.R.,  Via  Mario  Bianco  9, 20131  Milano,  I 

ABSTRACT 

Radial  Basis  Functions  have  been  recently  proposed  as  an  effective  method  for  the 
reoonstructicn  of  continous  functions,  starting  fitom  a  not  evenly  sampled  set  of  prmts.  This 
approadi  allows  a  convenient  lepiesentatian  of  the  reconstructing  function  thrcMg^  neural 
networks.  As  far  as  gaussian  radial  basis  ate  concerned,  the  parameters  that  oonq>letdy  define 
the  reconstructing  hmction  are  the  comdinates  oi  the  center  and  the  variatKe  of  eadi  gausaan 
and  their  number.  We  propose  here,  a  method  to  automatically  determine  foese  parameters 
starting  from  the  distances  between  the  sampled  data.  Preliminary  results  are  reported  for 
snudl  sets  of  bidimensional  ponts. 


INTRODUCTION 

The  reoonstrunction  of  a  continous  curve  g(x)  starting  from  a  set  of  points  by  means  of  a  parametric 
function  f(x)  =  F(x,w),  with  w  a  set  of  unknown  parameters,  is  an  ill-posed  problem  that  allows  an  infinite 
number  of  solutions,  all  compatible  with  the  data.  To  obtain  the  solution  optimal  to  the  problem,  a  dassical 
approach  is  to  introduce  soft  constraints  that  do  rxit  specify  exact  desired  values  of  the  function  F(x,w)  but 
only  a  tendeiKy  of  it  in  the  definitiot  domain.  One  of  the  most  promising  classes  of  reconstructing 
functions  has  been  recentiy  proposed  by  Poggio  and  Girosi  (1989);  it  allows  to  write  tiw  function  F(x,w)  as  a 
linear  amibination  of  radial  (gaussian)  functions: 

F(x,w)=:2)!]J^CiG(x;di/Oi)  (1) 

where  Cj,  dj  e  Oj  are  the  parameters  w  to  be  det»mined,  respectively:  the  amplitude  coefficient,  the  center 
and  the  variarKe  of  the  gaussian  Gj;  N  is  the  number  of  different  gaussian  functions  en^loyed;  N  £  M, 
where  M  is  tire  number  of  the  sampled  points  P;  =:(x{  ;yj ). 

In  tills  paper  we  propose  a  method  for  the  automatic  setting  of  tiiese  parameters  as  a  function  of  the 
distances  between  the  points,  to  achieve  the  optimal  recmstructing  function.  The  bdiaviour  of  the  methcxl 
has  been  tested  comparing  the  different  curves  obtained  from  a  set  of  randomly  generated  points. 

This  algorithm  can  be  easily  extended  to  multi-dimensicmal  functions  and  it  is  well  suited  for  a 
hardware  neural  network  implementation. 


METHOD 

Rrst,  the  c^timal  number  of  gaussian  functions  is  determined,  then  the  optimal  values  of  tiie 
coefficient  parameters  is  computed. 

The  coefficeints  Cj  can  be  determined  analytically  imposing  that  F(x,w)  passes  through  the  M 
sampled  pcmts. 

/Oi ) = yi  (2) 
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in  matrix  notation:  Y^CxC  where[Y]j 

If  the  sampled  points  are  dose  enough  one  to  the  other  and  the  number  of  the  centers  is  equal  to  the 
number  of  the  gaussians  (N  »  M),  the  matrix  G  is  close  to  singular  and  the  obtained  solution  for  the 
coefficients  C;  is  heavily  afiected  by  computatimal  errors.  It  is  therefore  mandatory  to  reduce  the  number 
of  gaussians.  ^foreover,  the  use  of  a  number  of  gaussian  basis  functions  smaller  than  the  number  of 
sampled  points,  prevents  overfitting  the  data:  F(x,w)  will  not  follow  the  ripples  caused  by  the  noise  on  the 
points  and  a  better  approximation  of  the  original  signal  is  produced  with  a  greater  ability  of  generalizing  to 
similar  signals.  In  Figure  1  the  reconstructing  curve  is  shown  for  a  set  of  M  points  (N=M)  and  diffnent 
values  of  the  variance  O.  Notice  the  high  value  of  ttte  coeffidents  and  the  high  oscillations  in  foe  curve, 
above  all  when  O  assumes  a  low  value. 


Fipm  1:  Three  d^ent  reconstructing  /unctions  F(x,w)  with  d^ent  values  of  the  variance  O.  With  solid 
line  G  reported  the  approximating  curve  with  c  =  0.1;  relative  co^ients  are:  (3.34;  -6.06;  3.19;  0.67;  -11.32;  18.12; 
-12.77;  -12.72;  21.36;  -5.63}.  With  dashed  line  is  reported  the  approximating  curve  vrith  o  =  0.05;  relative  coeficients 
are:  { -7.01;  10.55;  -6.30;  Z62;  -1.82;  1.83;  -1.54;  -1.54;  2.91;  -1.07).  With  dot  lines  is  reported  the  curve  obtained 
with  a  =  0.0006;  rdatioe  coefficients  ared-0.014;  0.021;  0.002;  0.004;  -0.057;  -0.055;  -37.167;  37.130;  0.041;  -0.053}. 
Points  coordinates  are:  {(-1.00,  -0.23);  (-0.74,  0.04);  (0.51,  0.66);  (-0.08,  -0.93);  (0.06,  -0.89);  (-0.56,  0.06);  (-0.91, 
0.34);  (0.04,  -0.98);  (0.36,  -0.23);  (0.09,  -0.87)1; 


Fbr  these  reascms  it  is  numerically  efficient,  to  reduce  foe  number  of  gaussian  basis  functions  adopted 
and  therefore  foe  number  of  coefficeints  Cj.  The  problem  can  be  summarized  as:  How  many  and  iMch 
gaussians  should  be  eliminated,  and  haw? 

We  propose  here  an  iterative  procedure  to  determine  the  number  and  foe  values  of  foe  centers. 
Initially,  the  abscissas  of  the  centers  are  set  coiiKident  with  the  abscissas  of  foe  sampled  points.  At  every 
iteration  step  one  gaussian  is  eliminated  as  follows:  foe  pair  of  gaussians  whose  cotters  are  the  nearest  ones 
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b  deteimined  and  a  gaussian  with  the  abscissa  equal  to  nnean  value  of  the  abscissas  of  the  two  centers  is 
substituted  to  them.  Ihis  procedure  is  ended  when  all  the  centers  result  are  separated  by  a  distance  greater 
than  a  predermined  threshold,  related  to  the  desired  degree  of  smoothness  for  the  curve. 

Alternatively,  an  analytical  solution  can  be  carried  out  The  optimal  number  of  gaussians  is 
determined  using  the  properties  of  the  singular  value  decomposition.  This  aiudytkal  technique  decomposes 
the  matrix  G  (NxM)  into  the  product  of  three  matrixes:  G  -  UWV^  where  U  aivi  V  are  orthonormal  (NxM 
and  MxM)  and  W  is  a  diagonal  matrix  (MxM)  which  contains  the  singular  values  [Grdub  aiKl  Van  Loan 
(1989)].  The  number  (rf  effective  centers  is  the  range  of  G  that  is  equal  to  the  number  of  sir^lar  values  of 
W,  significatively  different  from  zero.  For  the  same  points  reported  in  Hgure  1,  a  significant  reduction  in 
the  value  of  the  coefficients  is  obtained  reducing  from  10  to  8  the  number  of  centers.  Moreover  some 
ripples,  that  can  be  easily  attributed  to  noise,  are  filtered  out  as  can  be  seen  in  Figure  2. 

Once  the  niunber  of  gaussians,  N,  has  been  determined,  the  optimal  value  of  the  coefficbnts  can  b 
determined  as  [Poggio  e  Girosi  (1989)]: 

C  =  (G''“G+Xg)~'G^Y  (3) 

with  C,  Y  and  G  are  the  vectors  and  matrix  defined  in  equation  (2)  and  g  the  matrix  defined  as  follows: 
[g]^  =G(xi;Xj ).  The  parameter  A.  regulates  the  degree  of  smoothness  of  the  reconstructing  function.  The 

smaller  is  the  value  of  A. ,  the  nearer  to  the  points  will  be  the  reconstructing  function  F(x,w)  and  it  will  also 
undergo  to  undesiderable  osdliations  that  will  jrieid  a  poor  generalization  capability.  The  bigger  is  the 
value  of  X ,  the  smoother  is  the  funtion.  In  fitis  case  the  frequency  content  of  the  reconstructed  signal  will  be 
reduced  (Oppeham  and  Shafer  (1975)],  operating  a  low-pass  filtering  that  will  eventually  filter  out  possible 
rippels. 

The  parameter  X  is  therefore  glc4)al  over  die  entire  definition  domain  of  F(x,w).  Alternatively,  we 
can  play  with  the  value  of  the  variance  of  the  gaussians  that  is  used  in  the  reconstructing  function.  It  is  also 
related  to  the  degree  of  smoothness  in  the  reccmstructing  curve;  as  can  be  noted  from  Figure  1,  the  curve 
becomes  smoother  as  die  variance  of  the  gausrians  increases. 

The  variance  can  be  automatically  computed  following  the  heuristic  of  global  first  nearest-neighbor 
proposed  by  Moody  and  Darken  (1989).  The  variance  is  set  to  the  mean  of  the  minimal  distances  between 
each  cmter  and  its  nearest  point  Pj . 

The  sum  of  die  mean  square  distance  betweoi  F(x,w)  and  the  sampled  points  can  give  an  idea  on  the 
performaiKe  of  the  reconstructing  function.  In  Table  1,  this  distaiKe  is  plotted  as  a  functicxi  of  bodi  X  and 
the  variance  of  the  gaussians;  it  increases  with  the  increase  of  both  X  and  a  but  the  curve  becomes 
evidently  smoother. 
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labkJ;  The  data  of  this  table  are  r^emd  to  the  function  F{x,w)  constituted  qfSgiatssms  used  to  reconstruct 
the  curve  through  a  set  of  10  points  as  shown  in  figure  Z  Mean  square  distance  as  a  function  of  the  smoothness 
parameter  X  arU  of  the  variance  of  the  gtussians  are  reported.  In  italics  the  variance  as  computed  usb^  global  first 
nearest-nei^bbor  is  h^hlighted. 


We  may  get  a  better  approximation  of  the  function  considering  that  the  frequency  property  of  the 
signal  to  be  reconstructed,  may  not  be  equal  over  all  ttte  definition  domain.  Hus  can  be  obtained  by 
dKXisii^  gaussians  of  different  variances;  each  variance  Oj  will  be  a  function  of  ttte  distances  between  the 
sampled  points  and  it  is  set  equal  to  the  mean  distance  of  its  center  from  those  pcrints  that  falls  into  a 
certain  region  around  its  center;  this  region  has  the  fuivctim  of  a  receptive  field  for  each  gaussiait.  The 
reconstructing  curve  obtained  with  gaussian  of  different  variances,  is  plotted  in  figure  2  and  the  mean 
square  distance  reported  in  Table  2.  It  should  be  remarked  that  the  advantage  of  this  procedure  becomes 
apparent  for  large  set  of  data. 


VkriMC*  I  0.2023  02775  02773  02948  0.359  03582  0.1775  0.1752 
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Table  2:  The  data  of  this  table  are  referred  to  the  function  F(x,w)  constituted  of  8  gaussians  used  to 
reconstruct  the  curve  through  a  set  cf  10  points  as  shown  in  figure  Z  Mean  square  error  as  a  fitnction  of  the  variances 
of  the  gaussians  are  report^.  The  smoothness  parameter  X  issettoO.  The  amplitude  of  the  interval  to  determine  the 
variance  is  set  to  ±0.5. 


CONCLUSION 

Altttough  X.  can,  alrme,  regulate  the  degree  of  smoothness  of  die  reconstructing  function,  it  can  be 
used  when  analytical  solutions  are  feasible.  The  reccmstruction  of  functions  from  large  sets  of  data 
(surfaces,  multi-dimensional  temporal  sequences),  requires  the  use  of  numerical  solutions.  Gaussian  basis 
functions  allow  to  naturally  partition  the  definition  domain  into  regions  (receptioe fields),  that  ease  numerical 
solutions.  The  parameters  to  be  tuned  are  affected  only  from  the  behaviour  of  the  data  belonging  to  a  small 
sub-region  of  the  defiiution  domain.  Taking  advantage  of  this  property,  our  method  can  achieve  an  optimal 
reconstruction  of  functions  of  large  data  sets. 
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Moreover,  this  approach  is  particuJarly  suited  to  be  implemented  with  neural  networks:  it  allows  to 
define  its  topology  and  to  locally  tune  its  parameters  for  a  huge  amount  of  data. 


Bg/atJi.  Approximation  of  a  curve  starting  from  the  same  set  cf  points  as  in  Figure  l,throi^  a  reduced  set  cf 
gaussians.  The  sampled  points  are  represented  with  o .  Solid  line  represents  the  approximation  function  obtained  with 
Sgttussians  centered  in  x;  (-0.74;  0.51;  -0.08;  0.07;  -0.56;  0.87;  0.36;  -0.95}  with  relatioe  cotffkkntsi  -2.06;  22.78;  - 
13.59;  20.84;  3.36;  -5.88;  -27.62;  0.26}  and  variance  a  =  0.1  equal  for  all  gaussians.  Dashed  line  represents  the 
approximation  function  obtained  xaith  3  gaussians  centered  in  *:{  -0.01;  -0.8;  0.65}  with  relatioe  cotffkkntsH  -0.70; 
0.11;  -0.03}  and  variance  O  =  0.1  equal  for  all  gaussians.  Dot  line  represents  the  approximation  function  obtained 
with  8  gaussians  centered  in  x,  with  relatioe  coefficients:  (6.2530;  0.8655;  -23784;  -3.8514;  -0.9216;  -23840;  4.0225; 
-3.6141}  and  variance  reported  in  Table  2. 
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Abstract-Optkai  character  recognition  (OCR)  refers  to  a  process  whereby  printed  documents 
are  transformed  into  ASCII  files  for  the  purpose  of  compact  storage,  editing,  fast  retrieval,  and 
other  file  manipulatioas  through  the  use  of  a  computer.  Tte  recognition  process  of  an  OCR  system 
is  a  chalkttgiag  problem  and  is  made  difficult  by  added  noise,  image  distortion,  and  the  various 
character  typefaces,  sizes,  and  fonts  that  a  document  may  have.  In  this  study  a  neural  network 
approach  is  introduced  to  perform  high  accuracy  recognition  on  muHi-size  and  multi-font 
charactm;  a  novel  centroid-dithering  training  process  with  a  low  noise-sensitivity  normalization 
procedure  is  used  to  achieve  high  accuracy  results.  The  study  consists  of  two  parts.  The  first  part 
focuses  on  single  size  and  singie  font  characters,  and  a  two-layered  neural  network  is  trained  to 
recognize  the  full  set  of  94  ASCU  character  images  in  12-pt  Courier  font  When  tested  on  a 
database  1,072,452  characters,  this  neural  network  has  zero  recognition  errors.  The  second  part 
trades  accuracy  for  additionai  font  and  size  capability,  and  a  larger  two-layered  neural  netwoik  is 
trained  to  reci^nize  the  ftill  set  of  94  ASCII  character  images  for  ail  point  sizes  from  8  to  32  and  for 
12  commonly  used  fonts.  No  errors  were  incurred  while  testing  this  network  on  a  database  of 
347/100  characters  of  12  fimts  and  four  different  point  sizes.  Whai  tested  on  the  database  of 
1/172,452  Courier  12  point  characters,  this  neural  network  had  one  recognition  error. 


I.  Introduction 

In  today's  world  of  information,  countless  forms,  reports,  contracts,  and  letters  are 
generated  each  day;  hence,  the  need  to  archive,  retrieve,  update,  replicate,  and  distribute 
printed  documents  has  become  increasingly  important[l,2].  An  available  technology  that 
automates  these  tasks  on  computer  media  is  optical  character  recognition  (OCR);  printed 
documents  are  transformed  into  ASCII  files,  which  enable  compact  storage,  editing,  fast 
retrieval,  and  other  file  manipulations  throu^  the  use  of  a  computer.  An  overview  of  the 
(X^  process  is  illustrated  below; 
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Figure  1:  The  Optical  Character  Recognition  Process 

An  essential  requirement  for  OCR  lies  in  the  development  of  an  accurate 
recognition  algorithm  by  which  di^tized  images  are  analyzed  and  classified  into 
corresponding  characters.  Publi^ed  literature  report  error  rates  in  the  order  of  one 
percent  for  angle-font  recognition  and  higher  error  rates  for  multi-fonts  [3,4].  While  an 
error  rate  in  the  order  of  one  percent  may  appear  impressive,  it  would  generate  30  errors 
on  an  average  page  containing  3000  characters.  Such  error  rates  limit  the  usefulness  in 
many  applications  and  illustrate  the  need  for  a  more  accurate  recognition  algorithm. 

The  study  proposes  using  neural  networks  to  perform  high  accuracy  character 
recognition  and  consists  of  two  parts.  The  first  part  focuses  on  single  size  and  single  font 
characters,  and  a  two-layercd  neural  network  is  trained  to  recognize  the  full  set  of  94 
ASCn  character  images  in  12-pt  Courier  font>.  The  second  part  trades  accuracy  with 
additional  font  and  size  c^ability,  and  a  larger  two-layered  neural  network  is  trained  to 
recognize  the  full  set  of  94  ASCII  character  images  for  all  point  sizes  fi-om  8  to  32  and  for 
12  commonly  used  fonts. 

n.  Neural  Network  Implementation 

The  neural  network  used  to  recognize  single-size  and  single-font  character  images  has 
3000  inputs,  20  neurons  in  the  first  layer,  and  94  neurons  in  the  second  or  output  layer. 
The  neural  network  used  to  recognize  multi-size  and  multi-font  character  images  has  2500 
inputs,  100  neurons  in  the  first  layer,  and  94  neurons  in  the  output  layer.  Both  networks 
are  fully  connected  and  feedforward  with  the  «gmoidal  function  geno'ating  the 
nonlinearity.  The  training  algorithm  used  in  this  study  is  the  backpropagation 
algorithm[6,7]. 


‘Courier  font  is  inqxKtant  because  it  is  the  font  most  often  used  in  legal  documents.  The  technique  used 
in  the  development  of  a  neural  network  for  Courier  font  is  general  and  can  be  an>lied  to  any  oth^  siitgle 
font. 
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A.  Database 

The  first  neural  network  deals  with  single-size  and  single-font  character  recognition,  and 
the  training  and  testing  data  is  of  12  point  Courier  font.  The  training  data  is  comprised  of 
94  digitized  character  images;  there  is  a  one-to-one  correspondence  between  each  training 
data  and  each  member  of  the  12  point  Courier  font  character  set.  The  neural  network  is 
thoroughly  evaluated  with  testing  data,  comprising  of  1,072,452  character  images  fi’om  a 
library  of  English  short  stories. 

The  second  neural  network  deals  with  multi-size  and  multi-font  character 
recognition.  The  allowable  point  size  ranges  from  8  to  32,  and  the  fonts  include  Arial, 
AvantGarde,  Courier,  Helvetica,  Lucida  Bright,  Lucida  Fax,  Lucida  Sans,  Lucida  Sans 
Typewriter,  New  Century  Schoolbook,  Palatino,  Times,  and  Times  New  Roman.  The 
training  data  is  comprised  of  1,128  (or  94  x  12)  character  images;  each  member  of  the 
complete  character  set  for  each  font  appears  exactly  once  in  the  training  set.  It  is 
important  to  note  that  all  training  character  images  are  of  16  point  size,  even  though  the 
network  is  trained  to  perform  recognition  on  multi-size  characters.  This  is  explained  in  the 
next  section.  The  testing  data  consists  of  347,712  characters  or  28,976  characters  for 
each  font  and  has  an  even  mixture  of  8, 12,  16,  and  32  point  sizes. 


B.  Data  Preprocessing  and  Normalization 

Before  it  is  fed  to  the  neural  network,  the  digitized  image  is  preprocessed  and  normalized. 
The  preprocessing  and  normalization  procedure  serves  several  purposes;  it  reduces  noise 
sensitivity  and  makes  the  system  invariant  to  point  size,  image  contrast,  and  position 
displacement. 

The  reduction  of  noise  sensitivity  is  achieved  by  thresholding.  Thresholding 
removes  the  low-level  background  noise,  which  is  cmised  by  inherent  paper  nonuniformity, 
specks,  and  other  paper  defects.  The  input  image  is  filtered  by  zeroing  those  pbcels  whose 
values  are  less  then  20%  of  the  peak  pbtel  value,  while  the  remaining  pbcels  are 
unchanged.  The  threshold  setting  is  heuristic  and  has  been  empirically  shown  to  work 
well  for  white  paper.  The  threshold  setting  should  be  adjusted  accordingly  when  a 
different  paper  product  is  used,  e  g.  newspaper. 

Following  thresholding,  the  resultant  image  is  centered  by  positioning  the  centroid 
of  the  image  to  the  center  of  a  fixed  size  fi-ame.  The  centroid  (x,y)  of  an  image  is  defined 
as  foUows: 

'^pixel(x,y) 

"  =  i  I  0) 

X  y 
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"2,y^pixeKx,y) 

'^'^pixelix,y) 
y  * 


(2) 


For  the  12  point  Courier  font  case,  a  fi'ame  size  of  SO-by-60  pbcels  has  been  found  to  be 
adequate  in  enclosing  all  character  images.  For  the  multi-size  and  multi-font  case,  a  fi-ame 
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size  of  SO-by-SO  pixels  is  employed,  and  an  additional  scaling  process  must  be  applied  to 
the  images.  The  scaling  entails  initially  computing  the  radial  momoit  Mr  . 

Mr  =  JLJL -  (3) 

\  I  lpixel{x,y) 


Next,  the  image  is  enlarged  or  reduced  with  a  gain  factor  of  — ,  producing  images  of 

Mr 

constant  radial  moments.  The  value  of  this  constant  radial  moment  is  linked  to  the  selected 
frame  size  of  50-by-S0.  From  a  broader  perspective,  this  scaling  process  is  equivalent  to  a 
point-size  normalization  procedure  and  enables  a  neural  network  to  treat  all  character 
images  the  same  way  regardless  of  the  point  size.  An  illustration  of  the  thresholding, 
scaling,  and  centering  operations  is  shown  below; 
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(b)  Multi-Size  and  Multi-Font  PreproccMing 

Figure  2:  Preprocessing 

The  next  step  in  the  preprocessing  and  normalization  procedure  is  to  convert  the 
two-dimensional  images  into  vectors.  The  conversion  is  achieved  by  concatenating  the 
rows  of  the  two-dimensional  pixel  array.  It  follows  that  the  vector  for  the  single-size  and 
single-font  case  has  3,000  elements  and  that  the  vector  for  the  multi-size  and  multi-font 
case  has  2,500  elements.  Additionally,  each  vector  is  normalized  to  unit  power; 


The  normalization  reduces  sensitivity  to  varying  scanner  gains  (image-to-background 
contrast)  as  well  as  different  toner  darkness  (shades  of  ink).  This  unit-norm  vector  is  then 
fed  into  the  neural  network. 

During  training,  there  is  an  additional  step  performed  on  the  input  data,  centroid 
dithering.  The  centroid  dithering  process  applies  to  both  the  single-size  and  single-font 
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case  as  well  as  the  multi-size  and  multi-font  case.  The  process  involves  dithering  the 
centroid  of  the  two-dimensional  input  image.  After  centering  and  scaling,  the  input  image 
is  displaced  randomly  and  independently  in  both  the  horizontal  and  vertical  directions  over 
the  range  of  [-2,+2]  pixels  in  each  dimension;  the  image  is  shifted  at  random  in  one  of 
twenty-five  possible  displacement  positions.  The  resultant  image  is  then  converted  into  a 
vector,  normalized,  and  fed  into  the  network  as  previously  described. 

Centroid  dithering  effectively  creates  many  "different"  images  from  a  single  image. 
The  neural  network  is  exposed  to  the  same  character  at  different  displacement  positions, 
making  the  recognition  system  invariant  to  input  displacements.  It  is  important  to 
emphasize  that  the  dithering  is  performed  exclusively  during  training  and  not  during 
testing.  There  are  several  other  added  advantages  of  using  this  technique.  For  example,  the 
approach  does  not  increase  the  number  of  training  data,  and  the  amount  of  training  data 
can  be  kept  at  a  minimum.  The  approach  also  enables  the  network  to  tolerate  width 
variations  in  character  strokes  which  might  be  caused  by  different  printer  setting,  toner 
levels,  and  variations  in  font  implementation.  This  is  particularly  useful  when  bold  face 
characters  are  encountered. 

C.  Training  and  Testing 


(log  mle) 


Figure  3:  Learning  Curve  of  the  Single-Size  and  Single-Font  Neural  Network 

The  training  is  performed  using  the  backpropagation  algorithm  with  an  initial 
learning-rate  parameter  of  p=10  for  both  the  single-size  and  single-font  neural  network 
and  the  multi-size  and  multi-font  neural  network.  The  learning  progress  is  monitored  by 
computing  the  mean  squared  error  (m.s.e.)  for  each  output  neuron; 
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where  C  is  the  cost  function  defined  by; 
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Here,  dj  and  yj  designate  the  desired  and  actual  outputs  of  the  i^  neuron  in  the  last  layer 
in  response  to  the  p^  input  pattern.  N  is  the  number  of  neurons  in  the  output  layer 


(N=94). 

The  m.s.e.  at  the  onset  of  training  is  predictable.  Since  they  are  randomly  initialized 
over  the  interval  [  -10*1®  ^  +10-1®  ],  all  tveights  may  be  approximated  as  zeros.  Hence,  all 
neurons  assume  the  value  of  Sgm(0)  =  0.5  in  the  initial  m.s.e.  calculations,  irrespective  of 
the  input  pattern.  With  a  desired  value  of  either  1  or  0,  each  output  neuron  generates  an 
error  of  ±0.5,  and  the  resultant  mean  squared  error  is  0.25. 

The  m.s.e.  values  as  a  function  of  training  iterations  are  plotted  and  shown  in 
Figure  3  for  the  single-size  and  single-font  neural  network.  The  single-size  and  single-font 
neural  network  was  trained  with  430,000  iterations,  and  the  final  m.s.e.  is  approximately 
10*^.  The  multi-size  and  multi-font  neural  network  was  trained  with  8,650,000  iterations, 
and  the  final  m  s.e.  is  approximately  2  *  10*®. 


D.  Postprocessing 

Postprocessing  refers  to  a  simple  procedure  by  which  the  output  of  the  n^al  network  is 
analyzed  and  modified.  An  important  task  of  postprocessing  pertains  to  the  detection  of 
invalid  character  inputs.  More  specifically,  the  detection  is  accomplished  by  observing  the 
occurrence  of  small  responses  on  all  output  neurons.  This  is  an  intrinsic  property  of  a 
trained  neural  network  and  is  very  useful  in  discounting  bad  images  which  might  result 
fi-om  segmentation  errors  or  other  defects. 

The  second  function  of  postprocessing  involves  recovering  lost  information  fi-om 
scaling  and  centering  multi-size  and  multi-font  character  images  and  is  used  for  the  multi¬ 
size  and  multi-font  system  only.  Tt»  characters  c,  C,  k,  K,  o,  O,  p,  P,  s,  S,  v,  V,  w,  W,  x, 
X,  z,  and  Z  of  certain  fonts  lose  their  case  information  after  scaling  and  are  therefore 
recognized  by  the  neural  network  without  an  afiBrmative  upper/low^  case  identification. 
This  case  information,  however,  can  easUy  be  reconstructed  by  a  context-based  approach. 
The  technique  resorts  to  examining  the  radial  moments  of  the  ori^nal  charade  images 
prior  to  scaling  and  is  best  explained  by  an  example,  ^thout  loss  of  generality,  it  is 
assumed  that  the  neural  network  identifies  an  image  as  the  character  "c."  The  first  step  is 
to  deduce  the  point  size  of  this  "c"  by  computing  the  gain  10 /A/,.*,  where  M*  is  the 
radial  moment  of  a  neighboring  character  that  is  case  distinguishable.  The  next  step  is  to 
calculate  the  radial  moments  of  a  fabricated  upper  case  "C"  and  a  febricated  lower  case  "c" 
of  this  point  size.  The  case  information  is  then  obt^ed  by  comparing  the  radial  moment 
of  the  input  character  "c"  with  those  of  the  fabricated  ones. 

Commas  and  single  quotes  of  certain  fonts  also  become  indistinguishable  after 
centring  and  scaling.  The  discrimination  between  these  two  characters  is  made  by 
comparing  the  centroid  location  of  the  input  character  image  before  preprocessing  to  the 
height  of  the  line.  Finally,  the  numeral  zero  cannot  be  reliably  distinguished  fi-om  the  letter 
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"O"  in  some  fonts,  and  similarly  the  numeral  "1",  lower  case  "L",  upper  case  "I",  and 
vertical  bar  "|"  are  ambiguous  for  some  fonts.  Under  these  circumstances  the  characters 
are  left  as  they  are  without  any  postprocessing. 

III.  Recognition  Performance 

A.  Results 

In  order  to  determine  the  recognition  performance,  a  computer  program  is  used  to 
compare  the  output  of  the  recognition  system  with  the  ASCII  files  which  were  used  to 
generate  the  testing  data.  The  computer  program  examines  each  and  every  output 
character,  and  all  discrepancies  excluding  spaces,  tabs,  and  carriage  return  are  recorded. 
The  discrepancies  are  individually  examined  and  classified  as  either  an  "erroneous"  or  a 
"correct"  recognition. 

There  are  two  situations  where  discrepancies  are  classified  as  correct  recognition. 
The  first  case  involves  image  corruption  which  renders  the  invalid  images  unrecognizable. 
Figure  4  provides  examples  of  invalid  inputs  due  to  segmentation  error,  scanning  error, 
and  paper  residue.  As  explained  in  Section  II  Part  D,  the  neural  network  automatically 
generates  small  responses  on  all  the  output  neurons  to  indicate  "bad"  inputs.  Such 
occurrence  is  detected  during  postprocessing,  and  the  discrepancy  is  not  counted  as  a 
recognition  error.  The  second  case  involves  characters  that  are  indistinguishable.  These 
include  the  numeral  "0",  letter  "O",  lower  case  "L",  upper  case  "I",  numeral  "  1 ",  and  the 
vertical  bar  "1"  in  some  fonts.  The  ambiguity  arises  from  the  fact  that  one  character  from 
one  font  looks  identically  to  a  different  character  of  another  font.  Henceforth,  the  neural 
network  may  output  any  of  the  ambiguous  characters  when  the  input  is  ambiguous,  and  it 
is  not  counted  as  an  error.  Any  other  discrepancies  which  do  not  fall  into  one  of  these  two 
categories  are  counted  as  recognition  errors. 

The  character-ambiguity  problem  does  not  apply  to  the  single-size  and  single-font 
experiment,  since  all  characters  of  the  Courier  font  are  distinguishable.  All  discrepancies 
except  those  of  corrupted  images  are  treated  as  errors.  The  neural  network  is  required  to 
recognize  all  94  characters  including  the  difficult  distinction  between  the  lower  case  L 
("1")  and  the  numeral  one  ("1").  The  single-size  and  single-font  neural  network  is  tested 
with  1,072,452  characters  of  12  point  Courier  font,  and  a  perfect  recognition  accuracy  has 
been  achieved.  This  recognition  performance  exceeds  any  previously  known  results  by  at 
least  an  order  of  magnitude. 
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Figure  4.  Neural  Responses  to  Invalid  Images  as  a  Result  of 
(a)  Segmentation  Error,  (b)  Scanning  Error,  and  (c)  Paper  Residue 

The  multi-size  and  multi-font  n^ral  network  was  tested  with  347,712  characters 
or  28,976  characters  for  each  of  the  following  fonts:  Arial,  AvantGarde,  Courier, 
Helvetica,  Lucida  Bright,  Lucida  Fax,  Lucida  Sans,  Lucida  Sans  Typewriter,  New  Century 
Schoolbook,  Palatino,  Times,  and  Times  New  Roman.  The  testing  data  consists  of  an  even 
mixture  of  8,  12,  16,  and  32  point  sizes.  Using  the  performance  criteria  as  previously 
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described,  the  multi-size  and  multi-font  neural  network  has  achieved  a  perfect  recognition 
accuracy.  The  same  network  was  also  tested  with  the  data  used  for  the  single-size  and 
single-font  neural  network.  There  was  one  recognition  error  among  the  1,072,452  testing 
characters  of  12  point  Courier  font.  The  error  is  documented  in  Figure  5. 


INPUT  OUTPUT 

Figure  5.  Multi-Size  and  Multi-Font  Recognition  Error 


B.  Analysis 

The  question  arises;  if  n  independent  trials  of  an  experiment  have  resulted  in  success, 
what  is  the  probability  that  the  next  trial  will  result  in  success?  In  this  context,  we  employ 
Laplace's  Special  Rule  of  Succession[13]  which  yields  an  estimate  of  the  probability  of 


success  p  = 


«+2' 


For  the  single-size  and  single-font  neural  network,  we  obtain 


1*072  453 

p  =  ^  072' 454  ~  multi-size  and  multi-font  neural  network, 

=  =  99.99971%  . 

347,714 

Alternately,  we  introduce  the  following  statistical  analysis  in  order  to  quantify  a 
lower  bound  for  the  recognition  accuracy  on  future  testing  data.  Given  a  testing  image 
corresponding  to  the  p*  character  where  p  €  {1,2,..  ,94},  we  define  two  random 
variables. 

Ap  yp  (7) 

Bp  (*> 

where  yp  is  the  output  of  the  p*f*  neuron  of  the  output  layer.  The  correct  recognition  of  a 
character  requires  that  Ap  >  Bp.  The  conditional  probability  of  error  given  an  input  image 
of  the  p^  character  is  derived  toIow. 

Prob{error  |  p  )  =  Prob(  Ap  <  Bp  |  p )  (9) 

<Prob(|Ap  -Bp-E(Ap  -Bp)|>|E(Ap  -Bp)||p)  (10) 
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(H) 


Var(Ap-Bp) 

-  [E(Ap-Bp)]2 

«Var(Ap-^)  (12) 

Inequality  (11)  invokes  the  Chebyshev  inequality,  and  inequality  (12)  approximates  Ap  as 
1  and  Bp  as  0.  These  assumptions  are  verified  by  the  sample  averages  obtained  mom 
testing  data. 

To  illustrate  the  concept,  we  apply  the  Chebyshev  lower  bound  to  the  four  most 
firequent  characters  in  our  data:  "e",  "t",  "a",  and  "o".  These  letters  are  all  in  lower  case 
except  the  letter  "o".  Since  it  is  an  ambiguous  character,  the  samples  could  also  be  an 
upper  case  or  a  zero.  The  following  tables  summarize  the  sample  averages  and  variances 
computed  during  the  test  run  and  the  resultant  bound  on  the  probability  of  error. 


Character 

Samples 

E(\-Bp) 

Var(Ap-Bp) 

( s:  Chebyshev  upper  bound) 

"e" 

113,060 

0.99 

• 

o 

1 

80,273 

0.99 

5.8*  10-5 

"a" 

72,565 

0.99 

2.9*  10-5 

"o" 

70,423 

0.98 

5.6*  10-5 

Table  1.  Chebyshev  Upper  Bound  for  the  Probability  of  Error  for  the  Single-Size 

and  Single-Font  Neural  Netwoiic 


Character 

Samples 

E(A,-Bp) 

Var(Ap-Bp) 

( s  Chebyshev  upper  bound) 

"e" 

38,784 

0.99 

3.9*  10-5 

0.99 

i.r  i(H 

"a" 

24,528 

0.99 

3.2*  10-^ 

"o" 

26,112 

0.99 

9.2*  10-5 

Table  2.  Chebyshev  upper  Bound  for  t 

le  Probability  of  Error  for  the  Multi-Size 

and  Multi-Font  Neural  Network 

The  Chebyshev  bound  is  known  in  general  to  be  a  conservative  bound,  and  the  upper 
bounds  on  the  probabilities  of  error  in  Tables  1  and  2  are  much  higher  than  those 
estimated  by  the  Laplace  Rule  of  Succession. 


rV.  Conclusions 

The  study  presents  a  neural  network  scheme  with  centroid  dithering  and  a  low  noise- 
sensitivity  normalization  procedure  for  high  accuracy  optical  character  recognition.  The 
single-size  and  siiigle-font  neural  network  has  been  successililly  trained  to  recognize  12 
point  Courier  font  characters.  The  neural  network  was  trained  with  a  database  of  94 
character  images.  The  neural  network  was  tested  on  a  database  of  1,072,452  character 
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images  and  achieved  perfect  recognition.  Based  on  the  experience  of  this  network,  a 
larger  neural  network  was  successfully  designed  and  trained  to  recognize  characters  of  12 
commonly  used  fonts  and  point  sizes  from  8  to  32.  The  latter  neural  network  was  trained 
with  1,128  character  images,  and  it  achieved  perfect  recognition  on  a  testing  database  of 
347,712  multi-size  and  multi-font  characters.  To  gauge  the  tradeoff  between  the  two 
networks,  the  multi-size  and  multi-font  neural  network  was  tested  on  the  1,072,452 
Courier  character  database,  and  one  error  was  incurred. 
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Two-Layer  Linear  Structures  for  Fast  Adaptive  Filtering 
Fran^oise  Beaufays,  Bernard  Widrow* 

Abstract 

The  LMS  algorithm  invented  by  Widrow  and  Hoff  in  1959  is  the  simplest,  most  robust,  and 
one  of  the  most  widely  used  algorithms  for  adaptive  filtering.  Unfortunately,  it  suffers  from  high 
sensitivity  to  the  conditioning  of  its  input  autocorrelation  matrix:  the  higher  the  input  eigenvalue 
spread,  the  slower  the  convergence  of  the  adaptive  weights. 

This  problem  can  be  overcome  by  preprocessing  the  inputs  to  the  LMS  filter  with  a  fixed 
data-independent  transformation  that,  at  least  partially,  decorrelates  the  inputs.  Typically,  the 
preprocessing  consists  of  a  DFT  or  a  DCT  transformation  followed  by  a  power  normalization  stage. 
The  resulting  algorithms  are  called  DFT-LMS  and  DCT-LMS.  A  fast  and  robust  implementation 
of  the  DFT  or  the  DCT  preprocessing  stage  is  itself  obtained  by  using  an  adaptive  filter  based 
on  the  LMS  algorithm.  The  overall  structure  is  thus  a  fully  adaptive  two-layer  linear  filter,  which 
achieves  better  speed  performance  than  pure  LMS  while  retaining  its  low  computational  cost  and 
its  extreme  robustness. 


Figure  1:  The  DFT-LMS  and  DCT-LMS  algorithms:  block  diagram. 

The  DFT-LMS  and  DCT-LMS  algorithms  are  represented  in  Fig.l.  The  signal  Xk  is  passed 
through  a  tap-delay  line  whose  outputs  are  transformed  by  a  DFT  or  a  DCT.  This  transformation 
splits  the  signal  into  different  frequency  components  that  are  approximately  uncorrelated.  The 
outputs  of  the  DFT/DCT,  «*(•)>  3^®  then  fed  to  the  adaptive  filter,  whose  weights  are  adjusted  using 
the  power  normalized  LMS  algorithm  (t.e.  a  version  of  the  LMS  algorithm  where  each  weight  has 
a  learning  rate  that  is  inversely  proportional  to  the  estimated  power  of  its  input).  The  DFT/DCT 

*The  authon  ue  with  the  Department  of  Electrical  Engineeiing,  Stanford  Univeinty,  Stanford,  CA  94305-4055. 
This  research  was  sponsored  by  EPRl  under  contract  RP8010-13,  by  NSF  under  grant  NSF  IRI  91-12531,  and  by 
ONR  under  contract  N00014-92-J-1787. 
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preprocessing  along  with  the  power  normalization  tend  to  make  the  equivalent  input  autocorrelation 
matrix  close  to  identity  and,  consequently,  to  improve  the  convergence  speed  of  the  LMS  filter 
weights.  This  approach  is  to  be  contrasted  with  recursive  least  squares  algorithms  where  the  inputs 
are  whitened  by  an  estimate  of  their  inverse  autocorrelation  matrix. 

The  performance  of  algorithms  based  on  data-independent  transformations  clearly  depends  on 
the  orthogonaiizing  capabilities  of  the  transform  used.  No  general  proof  exists  that  demonstrates 
the  superiority  of  one  transform  over  the  others.  DFT-LMS  first  introduced  by  Narayan  [1]  is 
the  simplest  algorithm  of  this  family,  mainly  because  of  the  exponential  nature  of  the  DFT.  It 
is  our  experience  though  that  in  most  practical  situations  DCT-LMS  performs  much  better  than 
DFT-LMS  [2].  In  addition,  it  has  the  advantage  of  being  real- valued. 

Since  the  signals  Xk,Xk-it  come  from  a  tap-delay  line,  their  DFT/DCT  at  a  given  iteration 
can  easily  be  calctilated  recursively  from  the  DFT/DCT  at  the  previous  iteration.  This  is  sometimes 
refered  to  as  the  sliding  DFT/DCT,  and  requires  only  0{N)  operations  per  iteration,  where  N  is 
the  length  of  the  LMS  filter.  However,  in  this  approach,  the  propagation  and  accumulation  of  errors 
due  for  example  to  round-off  noise  in  floating  point  arithmetic  makes  it  necessary  to  often  reset  the 
DFT/DCT.  This  increases  the  overall  number  of  computations  and  adds  to  the  complexity  of  the 
circuitry.  The  LMS  spectrum  analyzer  [3]  provides  another  way  of  computing  a  DFT  recursively 
in  0{N)  operations,  but  because  it  relies  on  an  adaptive  technique,  it  automatically  adjusts  for 
possible  errors. 

In  the  next  sections,  we  will  recall  the  principle  of  the  LMS  spectrum  analyzer  and  demonstrate 
its  robustness  to  noise  propagation.  We  will  then  generalize  it  to  the  case  of  the  DCT.  We  will 
conclude  by  presenting  a  fully  adaptive  two-layer  linear  structure:  the  first  layer  preprocesses  the 
inputs  to  the  second  layer,  which  effects  the  fast  filtering  operation. 

LMS  spectrum  analyzer  vs.  sliding  DFT 


The  DFT  of  the  signals  x*,  Xk-i,.  • . ,  Xk-ff+i  is  given  by 


DFTk 


11  1  ...  1 

Xk-N+l 

1  Q~^  Q~^  . . . 

Xk-N 

1  a~* 

• 

•  •  •  • 

. 

.1  Q-(JV-l) 

(1) 


where  a  =  and  j  =  y/^. 

Let  us  define  the  complex  phasor  Fk  =  y/i  [1  a*  where  T  denotes  the 

transpose.  The  series  of  phasors  Fq,  Fi,  F^,  ...  is  periodic  of  period  N  (i.e.  Fn=Fo,  Fn+i= 

Fi,  etc.)  and  {JFb,  Fi,  ...  Fn-i}  form  an  orthonormal  basis  in  the  N-dimensional  space: 


*  '  I  1  =  ^ 


(2) 


where  Fi  is  the  complex  conjugate  of  /}.  Eq.l  can  be  rewritten  as 


Drr.  =  /I 

■  1  ■ 
1 

1 

Xk-N+l  + 

1 

a~^ 

Xk-N  +  ---  +  \/j 

1 

a-lN-1) 

.  1  . 

a-lN-i)lN-i) 

(3) 


ni-88 


(4) 


k  k 

—  ^  ^  *m  ^m— fc+N— 1  ~  ^  *!  ®m  ^m— fc— 1 

m—k—S+\  tn=k—N+l 

=  P*  Y1  (5) 

m=k—N+l 

where  we  have  defined  the  diagonal  matrix  P  =  diag{l,  a,  a^, . . 

Sliding  DFT.  -  The  DFT  at  time  h  +  1  is  easily  related  to  the  DFT  at  time  k: 


*+i 

DFTk^l  =  ^  (6) 

tnssfc+l-JV+l 

_  p*+l|  ^  Xm  Tm-1  +  (*it+l  -  Xk-JV+l)7't]  (7) 

m=lt— W+1 

=  P  IDFTk  +  (xk+i  -  Xk-ff+i)  ^  n]  (8) 

=  P  [DFTk  +  (*fc+i  -  *fc-iv+i)  ^’o]-  (9) 


Each  element  of  the  DFT  at  time  ib  -f  1  is  obtained  by  adding  to  the  same  element  at  time  k 
the  contribution  of  the  newest  data  sample,  removing  the  contribution  of  the  oldest  one, 
^k-N+h  a^d  multiplying  by  a  phase  factor.  The  update  from  time  ib  to  1;  + 1  of  the  whole  DFT  {N 
components)  requires  thus  0{N)  operations.  This  is  to  be  contrasted  with  the  conventional  DFT, 
which  is  0{N^),  or  its  butterfiy  counterpart,  the  FFT,  which  is  0{N  logJV). 

LMS  spectrum  smalyzer.  -  The  LMS  spectrum  analyzer  [3]  is  represented  in  Figure  2.  The 
signal  to  be  transformed,  (we  will  see  shortly  how  relates  to  Xk),  is  used  as  the  desired  output 
of  an  adaptive  filter.  The  input  to  the  filter  at  time  k  is  the  complex  phasor  Fk. 


Figure  2:  The  LMS  ^ectrum  analyzer  for  calculating  the  DFT. 

The  adaptive  wdghts  of  the  spectrum  analyzer  are  updated  with  the  complex  LMS  algorithm  [4] : 

W[^.,=W[  +  2,,4T|„  (10) 
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where  and  are  respectively  the  weight  vector  and  the  error  signal  at  time  k,  and  fi.  is  the 
learning  rate.  The  error  ef  is  defined  as  the  difference  between  the  desired  output  df  and  the 
actual  output  y^,  ef  =  ~  Vk  —  ~  ^k  ^k  ■  Replacing  ef  in  the  weight  update  formula, 

chosing  the  initial  weight  vector  Wq  to  be  equal  to  zero  and  the  learning  rate  fi  to  be  equal  to 
1/2,  and  using  the  periodicity  and  orthonormality  properties  of  the  complex  phasors  Fk,  it  can  be 
shown  [3]  that 

E  (11) 

m=k—N  m=k—N+l 

Comparing  with  the  DFT  formula  (Eq.5),  it  is  clear  that  if  we  choose  df  _i  =Xm,  the  DFT  and  the 
LMS  analyzer  weights  are  related  by  the  simple  formula:  DFTk  =  P*  Wjf.  At  each  instant  k,  the 
weight  vector  of  the  LMS  filter  is  proportional  to  the  DFT  of  the  past  N  data  samples.  Note  that 
the  elements  of  the  multiplicative  diagonal  matrix  P*  are  precisely  equal  to  the  inputs  of  the  filter. 
The  DFT  components  are  thus  simply  obtained  by  pulling  output  lines  from  the  adaptive  weights 
(see  Fig.2). 

If  we  compare  the  expressions  for  the  weight  vector  (Eq.ll)  at  times  k  and  Jb  +  1  as  we  did  for 
the  DFT,  we  find  that 

^k+i  =  W'if  +  (aifc+i  Fk  -  ifc-N+i  Pfc),  (12) 

which  is  identical  to  Eq.9  since  the  DFT  and  the  weight  vector  at  time  k  differ  only  by  the  multi¬ 
plicative  factor  P*'.  Of  course,  this  algorithm  is  also  0{N). 

Although  the  sliding  DFT  and  the  spectrum  analyzer  look  very  similar,  they  differ  in  how  they 
handle  round-off  errors.  In  both  algorithms  the  DFT  is  computed  recursively;  noise  appearing  in 
the  DFT  at  time  k  thus  propagates  to  the  DFT  at  time  A;  -|- 1, 1:  -|-  2, . . .  Because  in  the  sliding  DFT 
(Eq.9)  the  elements  of  the  multiplicative  diagonal  matrix  P  all  have  modulus  one,  those  errors  will 
propagate  unattenuated,  and  will  accumulate  over  time  until  the  calculated  DFT  is  too  different 
from  the  true  DFT,  and  a  general  reset  of  the  DFT  is  required.  This  is  not  the  case  in  the  LMS 
spectrum  analyzer. 

Propagation  of  errors  in  the  LMS  spectrum  analyzer.  -  Let  us  consider  the  situation 
where  the  spectrum  analyzer  weight  vector  is  free  of  any  error  up  to  time  k  —  1.  At  time  k,  we 
^liberately  introduce  noise  in  Wk,  and  we  see  how  this  noise  vector,  Ck,  propagates  over  time.  Let 
K  =  M'f  —  €k  be  the  perturbated  wdght  vector.  The  LMS  error  signal  defined  as  the  difference 
between  the  desired  and  the  actual  outputs  is  given  by 

e*  =  dk-Fl  W[  =  dk-Ff  Wf  -1-  Ff  e*.  (13) 

Assuming  that  the  learning  rate  p  is  equal  to  1/2,  the  weight  vector  at  time  ib  -|- 1  is  given  by 

+  e-fc  Fk  =  -{I-Fk  Fjf)  Cfc,  (14) 

where  /  is  the  N  x  N  identity  matrix.  Similarly,  at  time  A:  -f  2,  the  weight  vector  is  ^ven  by 

W^if+2  =  -il-  Fk+1  Pj+i)  (/  -  Fk  Ff)  Ck,  (15) 

and  in  general,  for  any  time  A;  -f  j,  we  have^ 

=  Wf+i  -  n  (/  -  Fk+m  Ff^^)  ek,  (16) 

m=0 

♦it  can  be  verified  that  the  order  in  which  the  matrix  mnltiplies  are  effected  is  irrelevant.  This  justifies  the 
otherwise  ambiguous  notation 
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(17) 


Without  lost  of  generality,  we  can  assume  that  k  =  N  so  that 

m=0 

=  ^N+j~U{I-FmFl)eN.  (18) 

m=0 

Let  us  now  examine  how  the  multiplication  by  the  matrix  (/  -  F^)  affects  the  error  vector 

€jv.  The  vectors  {Xq,  Xi,. .  .Xn-i},  and  therefore  also  the  vectors  {Fo,  Fi,. .  -Faa-x},  form  an 
orthonormal  basis  in  the  N-dimensional  space.  Any  error  vector  eyv  can  be  decomposed  into  its  N 
components  in  this  last  basis: 

JV-l  _ 

CiV  =  ‘jv(”)  Fn-  (19) 

n=0 

The  product  (/  —  Fm  F^)  cjv  can  be  evaluated  as: 

_  _  _  _ 

(/  -  Fm  F^)€n  =  (I  -  Fm  F^)  fjv(«)  =  cn  “  fAr(m)  F„.  (20) 

nssO 

The  multiplication  of  the  error  by  (/  -  F^  F^)  eliminates  its  component,  and  leaves  the  other 
components  unchanged.  Multiplying  the  residual  error  vector  by  (/  -  Fm+i  F^^j )  will  cancel  out 
its  (m  +  1)‘**  component,  and  so  on.  As  iterations  go  by,  the  modulus  of  the  error  vector  decreases 
monotonically.  After  N  iterations,  all  its  components  have  been  cancelled,  and  the  error  is  reduced 
to  zero. 

In  software  and  hardware  implementations,  errors  occur  at  each  iteration.  While  the  sliding 
DFT  lets  these  errors  accumulate  over  time  and  requires  periodic  resets  of  the  DFT,  the  LMS 
spectrum  analyzer  can  be  run  without  interruption,  as  long  as  required  by  the  application. 

LMS  spectrum  analyzer  for  the  DCT 


As  pointed  out  previously,  it  is  our  experience  that  in  many  applications  the  DCT-LMS  al¬ 
gorithm  achieves  better  results  than  the  DFT-LMS.  There  east  many  different  discrete  cosine 
transforms  [5],  the  one  of  interest  here  is  defined  as 


FT 

DCTN.iip)  =  yl-k,Y:- 


'p{m  +  l/2)7r' 


where  DCTn-\(jp)  is  the  component  of  the  DCT  at  time  iV  -  1,  and  the  constant  kp  =  1/V^ 
for  p  =  0  and  1  otherwise.  Because  this  DCT  has  a  period  2N  instead  of  N  like  the  DFT,  special 
care  must  be  taken  in  deriving  an  LMS  cosine-spectrum  analyzer.  The  component  of  the  DCT 
at  arbitrary  time  k  can  be  written  as 


DCTt(p) 


m=*-iV+l 


p(m  —  it  -t-  TV  —  1  -f 


m=k-N+l 

/T  .  pkir  ^  .  p(m  -  1 -f- l/2)ir 


m=jfc— JV+1 
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In  order  to  obtun  a  fully  adaptive  structure  for  DCT-LMS,  the  two  sums  in  £q.24  must  be  evaluated 
using  LMS  spectrum  analyzers.  The  LMS  spectrum  analyzer  for  the  DFT  was  based  on  the  fact 
that  the  input  vectors,  Fk,  had  the  double  property  of  being  iV-periodic  {Fk  =  Fk+s)  nnd  forming 
an  orthonormal  basis  in  the  iV-dimensionad  space.  It  is  easy  to  verify  that  the  following  vectors 
have  the  same  properties: 


Ck 

Sk 


with 


,,  /T,,  0(i+l/2)ir,  l(i  +  l/2)»  .  (lV-l)(fc+l/2)ir, 

V  77  ~ — N — ^  ~ — - ’  ■  ■  ■  ’  ^ 


N 


N 


AT  1(*  + 1/2)»  2{k  +  l/2)x  N{k  +  l/2)w 

V  N  - N - ’  ^ - ’  ■  •  •  >  sin - - - J, 


N 


N 


(24) 

(25) 


Dk 


j  I  if  kelnN  in+l)N[ 
\  I>  if  ke  [nN  (n  +  l)iV  [ 


and  n  is  even 
and  n  is  odd, 


(26) 


where  /  is  the  identity  matrix  and  the  diagonal  matrix  D  =  diag{l,  -1,  1,  . . .}.  Note  that  Sk  was 
not  obtained  by  only  replacing  “cos”  by  “sin”  in  Ck,  the  indices  were  also  “shifted”  (i.e.  the  first 
component  starts  with  index  1  instead  of  0).  This  was  necessary  to  ensure  the  orthogonality  of  the 
Sk  vectors.  The  constants  hi, . .  .hw-i  are  equal  to  1  as  before,  k]\f  = 

By  analogy  with  the  DFT  case,  two  spectrum  analyzers  taking  for  desired  output  Xk,  and  for 
input  Ck  or  Sk  have  the  following  weight  vectors: 


m=k—N+l 


m=fc— Af+1 


(27) 


Let  us  also  define  two  other  LMS  spectrum  analyzers.  Their  desired  signal  is 


**  =  {!' 


Xk  if  h  €  [  nN  (n  +  l)iV  [  and  n  is  even 

Xk  if  h  6  [  nN  (n  4-  1)JV  [  and  n  is  odd. 


(28) 


and  the  input  vecrors  are  Ck  and  Sk  respectively.  The  wmght  vectors  are  given  by 

k  k 


^k=  E  ^’nCm-U 

m=k—N+l 


Sm—\ • 

m=k—N+l 


(29) 


Comparing  Eq.24  with  Eki.27  and  29,  and  using  the  definitions  for  Ck  and  Sk  (Eq.24  and  25),  we 
get  the  desired  result: 


DCTkip) 


cos^  Wl{p)  + Bin  ^Wj^ip)  if  pis  even 
cos  ^  w^{p)  +  sin  ^  VF^(p)  if  P  is  even 


(30) 


Instead  of  using  one  complex  LMS  filter  as  in  the  DFT,  we  used  four  real  LMS  filters  of  which  only 
half  of  the  weights  (the  even  or  the  odd  ones)  were  retained.  The  four  filters  can  of  course  be  run 
in  parallel.  As  in  the  case  of  the  DFT,  noise  rejection  is  ensured  by  the  adaptive  nature  of  the  LMS 
spectrum  analyzer. 
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A  Two-layer  Linear  Adaptive  Structure 


Let  us  now  incorporate  the  LMS  spectrum  analyzer  into  the  DFT-LMS  algorithm.  The  resulting 
structure  is  shown  in  Fig.3.  It  has  two  cascaded  layers  of  adaptive  weights,  but  nonetheless  it 
remains  a  linear  filter.  The  two-layer  DCT-LMS  structure  is  slightly  more  complicated  than  the 
DFT-LMS  one  but  it  is  based  on  the  same  principle. 


Figure  A  two-layer  linear  adaptive  filter. 

The  two-layer  adaptive  filter  shown  above  and  its  DCT  counterpart  are  simple  linear  structures 
containing  only  two  or  three  LMS  blocks.  They  achieve  faster  convergence  of  the  filtering  weights 
than  pure  LMS.  They  use  a  minimum  of  computations,  about  three  times  the  amount  for  LMS 
alone,  and  offer  excellent  robustness  properties.  Recursive  least  squares  algorithms  offer  even  better 
convergence  properties,  but  only  in  time-invariant  systems.  They  are  far  more  complicated  and  can 
be  unstable.  All  in  all,  the  two-layer  DFT-LMS  and  DCT-LMS  algorithms  should  find  increased 
use  in  practical  real-time  applications. 
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Abstract.  The  role  of  the  Self-Organizing  Map  (SOM)  algorithm  as  a  genuine  "distributed” 
neural-network  model  has  recently  strengthened,  when  it  has  become  apparent  that  this  algorithm 
has  a  neurophysiologically  justifiable  interpretation.  First,  a  laterally  interconnected  planar  network 
of  neural  cells  can  act  as  a  very  effective,  self-resetting  winner-take-all  circuit  if  one  describes  its 
cells  using  a  simple  nonlinear  dynamic  model.  Second,  the  Hebbian  law  of  synaptic  plasticity 
can  be  modified  such  that  if  the  nearby  synapses  are  made  to  interact  in  a  particular  way,  the 
synaptic  vectors  become  normalized.  Third,  the  lateral  interaction  between  neighboring  cells  in  the 
network  in  learning  may  be  implemented  partly  neurally,  partly  through  diffuse  chemical  agents. 
The  adaptive,  self-organizing  process  taking  place  in  such  a  physiological  model  can  then  be  shown 
to  be  almost  identical  with  that  defined  by  the  simpler  SOM  algorithms. 


1  Introduction 

The  Self-Organizing  Map  (SOM)  algorithm  [1,2]  may  take  on  many  forms  depending  on  the  par¬ 
ticular  vector-space  metrics  used  for  the  decoding  of  signal  patterns  by  the  neural  cells.  In  simple 
biologically  inspired  mathematical  neuron  models,  a  cell  is  usually  activated  by  the  incoming  sig¬ 
nals  in  proportion  to  their  weighted  sum,  whereby  the  weights  are  thought  to  represent  synaptic 
efficacies.  A  version  of  the  SOM  that  makes  use  of  such  cells  can  be  defined  in  the  following  way: 

OOOOO  ••• 

OOOOO 

oo®oo 

OOOOO 


Fig.  1.  Layout  of  a  Self-Organizing  Map 

Assume  a  planar  network  of  cells  (Fig.  1),  all  of  which  receive  the  same  input  signals  represented 
by  the  vector  x  =  (Ci»?2>-  •  -  €  K".  Let  each  cell  i  have  its  own  input  waght  vector  mj  = 

•  ••>/*<»» )^  e  R”.  By  means  of  lateral  feedback  connections  (cf.  Ch.  3)  the  cell  with  index 
t  =  c  becomes  the  "winner”  and  is  switched  into  the  high  activity  state  if 

mjx  =  mM{m7®}  ,  (1) 
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whereas,  due  to  the  same  lateral  interactions,  the  activity  of  all  the  other  cells  is  suppressed  to  a 
low  value.  This  switching  state  lasts  for  a  short  period,  after  which  it  is  automatically  reset  (cf. 
Ch.  3). 

"Learning”  in  the  above  network  means  that  the  cells  around  the  "winner”  c  are  adapted  to  the 
input  X  at  a  rate  hd,  where  /t„'  is  a  function  of  lateral  distance  of  cells  c  and  i  in  the  network, 
and  possibly  of  time,  too.  One  learning  law  that  is  compatible  with  (1)  may  be  expressed  (in 
discrete-time  steps  t)  as 


+  =  +  ,2) 
norm  of  numerator 

and  if  the  adaptation  steps  are  small,  the  first  terms  of  the  Taylor  series  with  ||mi(t)||  =  1  yield 

m,(t  +  1)  «  mj(t)  +  hci(t)  •  [x(t)  -  mi{i)mj{t)x(i)]  .  (3) 

It  will  be  pointed  out  in  Ch.  4  that  with  arbitrary  m,(0),  (3)  tends  to  normalize  the  m,-. 

A  physiological  SOM  model  that  behaves  like  the  algorithm  defined  by  (1)  and  (2),  or  (1)  and 
(3)  must  include  and  implement  the  following  functions:  (i)  a  winner-take-all  (WTA)  function  that 
selects  the  "winner”  and  switches  its  activity  on,  (ii)  a  reset  function  that,  after  a  small  delay, 
automatically  suppresses  the  "winner”,  (iii)  adaptation  of  the  synaptic  weights  resembling  the  law 
(3),  and  (iv)  interaction  of  neighboring  cells  in  the  network  during  learning  that  resembles  the  effect 
of  hd  in  (3). 


2  A  simple  nonlinear  dynamic  model  for  the  neurons 

The  output  activity  iji  (spiking  frequency)  of  neuron  i  may  be  described  by  an  effective  simplified 
differential  equation 


drii/dt  =  Ii-  7(»7<)  (4) 

where  is  the  combined  effect  of  all  inputs,  e.g.,  afferent  inputs  as  weU  2ts  feedbacks,  on  cell  i 
eventually  embedded  in  a  network  of  cells  [3].  In  simple  modeling,  without  much  loss  of  generality, 
Ii  may  be  thought  to  be  proportional  to  the  dot  product  of  the  signal  vector  and  the  synaptic 
efficacy  vector.  Let  7(»7,)  describe  the  resultant  of  all  loss  or  leakage  effects  that  oppose  to  /,-.  This 
is  an  abbreviated  way  of  writing;  since  q,  >  0,  (4)  only  holds  when  t?,-  >  0,  or  when  rn  =  0  and 
A'  —  "fiVi)  ^  whereas  otherwise  drji/dt  =  0.  We  have  also  found  that  for  stable  convergence  in  a 
system  of  interconnected  cells,  7(77,)  must  be  convex,  i.e.,  YX^li)  >  0. 

It  should  be  noticed  that  (4)  may  also  defir  ;  a  ”sigmoid”-type  transfer  function:  in  the  stationary 
state,  with  Ii  constant  in  time  and  drfifdt  =  0,  we  have 

Vi  =  (5) 

in  the  domain  where  rji  is  defined;  7'*  may,  for  instance,  saturate  at  high  input.  However,  (4)  is  a 
more  general  model  law,  because  it  can  be  used  to  describe  dynamic  phenomena  as  weU. 
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3  The  WTA  function 


Consider  now  Fig.  2  that  delineates  the  cross  section  of  a  two-dimensional  network  [4].  The  larger 
circles  represent  ’’principal”  neurons,  such  as  the  pyramidal  cells  in  the  cortex,  and  they  receive 
external  inputs  to  which  they  have  to  yield  a  selective  response.  The  smaller  circles,  the  "reset 
neurons”,  are  inhibitory  neurons  that  have  a  longer  time  constant,  and  in  the  simplest  model  each 
of  them  feeds  back  to  the  same  principal  neuron  to  which  it  is  assigned.  Their  purpose  is  to  suppress 
the  "winner”  after  a  certain  delay. 


Fig.  2.  Simplified  model  of  a  distributed  neural  network  (cross  section  of  a  two-dimensional 
array).  Each  location  consists  of  an  excitatory  principal  input  neuron  and  an  inhibitory 
interneuron  that  feeds  back  locally.  The  lateral  corrections  between  the  principal  input 
neurons  may  or  may  not  be  made  via  inter  neurons. 

In  a  more  complete  model  [^']  the  principal  neurons  may  be  interconnected  through  a  great  many 
excitatory  and  inhibitory  interneurons,  whereas  each  solid  arrow  in  Fig.  2  only  approximates  this 
"polysynaptic”  interconnection  between  cell »  and  cell  fc  by  an  effective  static  coupling  strength  j<fe. 
For  k  ^  i,gik  <  0,  while  gu  >  0.  This  approximation  is  justified  as  long  as  the  time  constants  of  the 
interneurons,  let  alone  the  "reset  neurons,”  are  small  or  at  least  of  the  same  order  of  magnitude  as 
those  of  the  principal  neurons  [6]. 

Referring  to  the  more  complete  discussion  in  [4]  we  write  the  systems  equations  as 

dt}i/dt  =  Ii-aCi-7(Vi) 

dQldt  =  6»7,-7i(Ci)  (6) 

where  a  and  6  are  constants,  and  similar  hard-limiting  restrictions  as  in  (4)  must  apply  to  the 
right-hand  sides.  Moreover  we  assume  that  li  is  decomposed  as  /,•  =  If  -1-  if,  where 

If  =  mjx  =  is  the  "external”  input,  and 

} 

if  —  represents  the  lateral  feedback,  respectively.  (7) 

k 

In  Fig.  3  we  approximate  the  loss  function  7i(Ct)  by  another  constant  6.  This  circuit  will  be 
seen  to  operate  in  cycles,  where  each  cycle  can  be  thought  to  correspond  to  one  discrete-time  phase 
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in  (2)  or  (3).  During  each  cycle  the  cell  corresponding  to  the  "winner”  (maximum  mjx)  will  take 
over  and  suppress  the  other  cells.  Normally  the  input  would  be  changed  during  each  new  cycle; 
however,  if  the  input  is  held  steady  for  a  longer  time,  the  next  cycle  activates  the  "runner-up”, 
after  which  the  "winner"  is  activated  again,  etc. 


Fig.  3.  Demonstration  of  the  WTA  function  provided  with  automatic  reset.  The  first  inputs  were 
applied  at  time  zero.  New  inputs  were  applied  as  indicated  by  the  dotted  arrow.  The  network  consisted  of 
20  cells,  and  the  inputs  If  =  mJx  were  selected  as  random  numbers  from  the  interval  (0,  1).  The  gu  were 
equal  to  0.5  wd  the  gik,i  ^  k,  were  —2.0,  respectively.  The  loss  function  had  the  form  y{ri)  =  O.llnl^; 
other  simpler  laws  can  also  be  used.  The  feedback  parameters  were  a  =  5  =  1, 0  =  0.5.  The  network  operates 
as  follows:  The  first  "winner"  is  the  cell  that  receives  the  largest  input;  its  response  will  first  stabilize  to  a 
high  value,  while  the  other  outputs  tend  to  zero.  When  the  activity  of  the  "winner"  is  temporarily  depressed 
by  the  dynamic  feedback,  the  other  cells  continue  competing.  The  solution  was  obtained  by  the  classical 
Runge-Kutta  numerical  integration  method,  using  a  step  size  of  10~'. 

It  may  be  necessary  to  emphasize  that  the  single^winner  WTA  circuit  described  above  was  obtained 
when  the  excitatory  feedback  connection  (  ga  above  )  is  only  made  to  the  same  principal  cell. 
In  biological  networks  it  is  more  plausible  that  excitatory  feedbacks  extend  to  a  greater  group  of 
neighboring  neurons,  making  the  activity  in,  say,  several  hundreds  of  nearby  cells  correlate  strongly. 
Such  "bubbles”  of  activity  have  been  simulated  [1,5],  whereas  a  full  mathematical  analysis  has  so 
far  been  carried  out  for  single-winner  WTA  networks  only. 

4  A  non-Hebbian  law  for  synaptic  modifiability 

The  Hebbian  adaptation  law  for  a  neuron  with  input  signals  and  output  activity  i/j  means  that 
for  the  adaptive  changes  at  synapses  we  assume  dftijfdt  ~  ^jrn.  Quite  apparently  the  pure  Hebbian 
law  is  unsatisfactory  since  mj  would  grow  monotonically.  In  order  to  make  reversible  changes 
possible,  we  must  take  into  account  the  mutual  interference  between  nearby  synapses  in  the  same 
cell.  If  this  interference  is  mediated  by  postsynaptic  coupling  by  nearby  synapse  r,  one  of  the 
simplest  and  most  natural  thinkable  laws  for  such  a  reversible  or  "active  forgetting”  effect  would 
read 

dpijldt  'V  (^J-  —  XfHj  •  Vi  (8) 
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where  A  is  a  decay  constant,  is  the  postsynaptic  effect  at  synapse  r,  which  proportionally 
reduces  the  synaptic  strength  Hij  at  synapse  j,  and  index  r  runs  over  nearby  synapses  on  the  same 
cell. 

In  order  to  implement  a  SOM  process,  the  law  (8)  must  still  be  modified  in  order  to  take 
into  account  the  interaction  of  neighboring  cells  during  learning.  Such  an  interaction  could  be 
implemented  by  another  system  of  short-range  lateral  neural  connections  that  do  not  contribute  to 
the  output  activities  rji  but  only  modulate  the  modifiabilit>  of  synapses  of  a  neighboring  neuron  (like 
the  chemical  transmitter  norepinephrine  does),  or  alternatively,  this  interaction  may  be  controlled 
by  nonspecific  chemical  agents,  acting  like  messengers  to  nearby  neurons.  Whatever  the  nature  of 
such  a  spatial  modulatory  interaction  is,  it  may  be  described  at  neuron  t  by  a  sum  term  hurn, 
where  the  hn  are  interaction  strengths  during  learning,  and  i  runs  laterally  over  the  neighborhood 
of  cell  t  in  the  network.  Finally  we  thus  get  for  the  form  of  the  non-Hebbian  adaptation  law  to  be 
used  to  describe  learning  effects  in  the  SOM, 


—  (^i  ^  ^  ^  •  (9) 

r  / 

It  may  already  be  discernible  from  (9),  but  it  has  further  been  justified  in  [4],  that  (9)  and  (3)  are 
very  similar  expressions:  (9)  in  discrete  time,  (3)  in  continuous  time.  As  a  matter  of  fact,  (6),  (7), 
and  (9)  together  define  a  continuous-time  self-organizing  process  that  has  very  similar  properties 
as  the  SOM.  Illustrative  solutions  for  the  7,-(t)  and  that  confirm  this  have  been  obtained  by 
numerical  simulation  (Ch.  5),  whereby  formation  of  self-organizing  maps  has  been  observed,  in  a 
similar  fashion  as  by  the  SOM  algorithm,  (1)  and  (3). 

Relating  to  (9),  we  may  now  consider  a  local  subset  of  synapses  at  the  cell’s  membrane,  such 
as  the  synapses  of  a  large  apical  dendrite  branch  of  a  pyramidal  cell,  over  which  index  r  runs, 
and  denote  the  input  to  this  subset  by  a:  a:  with  m,-  =  the 

corresponding  8)rnaptic  weight  vector.  Further  it  seems  reasonable  to  resort  to  a  kind  of  "mean 
field”  approximation,  whereby,  from  the  point  of  view  of  cell  i,  the  factor  which  we  denote 

by  a,  shall  not  depend  on  rn,  or  depends  on  it  only  weakly.  Then  (9)  can  be  written  in  vector  form, 

dmi/dt  =  ax  —  Pmimjx  ,  (10) 

with  P  =  Xa.  This  is  a  matrix  Riccati  equation,  and  the  author  has  solved  it  in  his  book  of  1984  [1]. 
With  time,  every  m,-  becomes  normalized  to  the  same  length  Xy/al\\x\\y/^,  where  x  is  the  mean  of 
its  input!  This  is  exactly  what  is  needed  to  make  (9)  compare  with  (2)  and  (3). 

It  should  further  be  emphasized  that  the  cyclic  operation  of  the  WTA  function,  as  described  in 
Ch.  3,  in  effect  samples  the  input  signals.  The  physiological  model,  although  originally  written 
in  continuous  time,  then  acts  as  if  the  signals  were  expressed  as  a  discrete-time  series!  Thus  the 
analogy  of  the  physiological  model  and  the  algorithmic  SOM  is  even  closer. 


5  Animation 

Fig.  4  shows  three  frames  from  a  film  that  describes  the  continuous-time  self-organizing  process, 
based  on  (6),  (7)  and  (9).  The  input  vector  x  was  two-dimensional  and  had  a  uniform  distribution 
over  the  square  frame;  this  example  corresponds  to  the  standard  SOM-experiments  found,  e.g.,  in 
[1],  where  the  weight  vectors  coincide  with  the  nodes  of  this  net.  It  took  three  weeks  to  run  this 
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simulation  on  a  Silicon  Graphics  R3000  computer,  because  the  stepwise  solution  of  the  differential 
equations  (with  relative  step  size  10”®)  was  extremely  slow.  The  corresponding  learning  process 
based  on  the  SOM  algorithm  (1)  and  (2)  takes  only  a  few  dozen  seconds. 


Fig.  4. 


It  should  also  be  mentioned  that  the  last  frame  in  Fig.  4  does  not  yet  represent  the  final  converged 
state,  but  is  still  somewhat  "shrunk”;  we  stopped  the  program  after  three  weeks  of  running. 


6  Conclusion 

We  have  been  able  to  show  that  the  SOM  algorithm  may  describe  the  behavior  of  a  neurophysi¬ 
ological  system  rather  accurately,  while  the  original  algorithm  is  computationally  on  the  order  of 
30000  times  faster  than  the  more  naturalistic  description  studied  in  this  paper. 


References 


[1]  T.  Kohonen,  Self-Organization  and  Associative  Memory,  Heidelberg:  Springer,  1984,  3rd  ed. 

1989. 

[2]  T.  Kohonen,  "The  self-organizing  map,”  Proc.  IEEE,  vol.  78,  pp.  1464-1480,  September 

1990. 

[3]  T.  Kohonen,  "An  introduction  to  neural  computing,”  Neural  Networks,  vol.  1,  pp.  3-16, 
1988. 

[4]  T.  Kohonen,  "Physiological  Interpretation  of  the  Self- Organizing  Map  Algorithm,”  Neural 
Networks,  vol.  6,  pp.  895-905, 1993. 

[5]  T.  Kohonen,  "The  ’Neural’  Phonetic  Typewriter,”  Computer,  vol.  21,  pp.  11-22,  March  1988. 

[6]  S.  Kaski  and  T.  Kohonen,  "Winner- Take-All  Networks  for  Physiological  Models  of  Compet¬ 
itive  Learning,”  to  appear  in  Neural  Networks,  a  special  issue  on  Models  of  Neurodynamics 
and  Behaviour. 


in- 102 


Adaptive  Wavelet  Networks  for  Pattern  Recognition 

Brian  A.  Telfer  and  Harold  H.  Szu 
Naval  Surface  Warfare  Center,  Dahlgren  Division,  Code  B44 
Silver  Spring,  MO  20903 

Abstract 

The  utility  and  robustness  of  wavelet  features  is  demonstrated  through  three  practical  case  studies  of  de¬ 
tecting  objects  in  multispectral  electro-optical  imagery,  sidescan  sonar  imagery  and  acoustic  backscatter. 
Emphasis  is  placed  on  choosing  proper  waveforms  for  particular  applications,  on  advantages  of  using  multi¬ 
ple  wavrform-types  to  detect  local  features  in  an  object,  and  on  adaptively  computing  the  waveforms  and 
their  dilation  and  shift  parameters  to  optimize  classification  accuracy. 

1  Introduction 

In  any  pattern  recognition  application,  proper  choice  of  features  is  a  critical  issue.  It  is  well  known  that 
for  practical  applications  with  finite  training  data,  increasing  the  number  of  features  up  to  a  point  reduces 
the  test  set  misclassification  error,  while  increasing  beyond  that  point  increases  the  error  because  too  many 
trainable  classifier  parameters  causes  overfitting  [1].  Thus,  it  is  important  to  choose  a  small  number  of 
features  that  contain  the  most  discriminatory  information.  Just  as  important  for  practical  applications,  the 
features  must  be  robust  to  data  variations  not  necessarily  exhibited  in  the  training  set.  We  have  found  in 
several  real  applications  that  information  at  different  resolution  scales  provided  by  wavelet  features  leads  to 
highly  discriminating,  robust  classifiers.  Additionally,  adapting  the  wavelets  to  specific  applications  using 
neural  networks  results  in  a  small  number  of  features  that  reduces  the  effects  of  overfitting. 

Wavelets  are  attractive  features  for  their  ability  to  examine  data  at  different  scales  and  frequencies. 
Additionally,  unlike  windowed  Fourier  transforms,  the  wavelet  transform  allows  a  choice  of  basis.  Thus, 
in  using  wavelet  features,  the  proper  wavelet  waveform  and  a  small  number  of  shifts  and  dilations  should 
be  chosen  to  provide  significant  discriminatory  information.  To  address  this,  we  have  combined  wavelets 
and  neural  networks  to  adaptively  compute  a  superposition- wavelet  filter  that  is  optimized  for  classification 
[2].  Others  have  also  studied  wavelet  networks,  but  for  function  approximation  rather  than  as  classifiers 
of  wavelet-based  features  [3-8].  ([8]  implements  the  promising  approach  of  determining  class  boundaries 
using  wavelets,  but  this  differs  from  using  wavelet  features.)  Selecting  wavelet  features  for  classification  is 
quite  different  than  for  approximation  iind  representation.  The  different  considerations  that  are  important 
for  classification  are:  1)  features  must  contain  information  that  differs  between  classes,  rather  than  the 
information  in  common  to  the  data,  2)  orthogonality  is  not  as  important  as  are  waveforms  that  match 
the  application  to  extract  the  most  discriminatory  information,  3)  selecting  the  best  features  need  not  be 
real-time  since  it  is  an  off-line  process. 

After  a  brief  mention  of  the  central  ideas  of  wavelets,  three  case  studies  are  presented.  The  first  [9], 
detecting  objects  from  multispectral  electro-optical  imagery,  demonstrates  the  power  of  an  on-center,  off- 
center  filter  to  remove  background  clutter  and  camera  nonuniformities,  within  the  context  of  wavelets  to 
process  the  data  at  different  scales.  The  emphasis  in  the  second  study  [10]  with  sidescan  sonar  imagery  was 
to  demonstrate  the  utility  of  composite  wavelet  features,  that  is,  groups  of  different  wavelet  waveforms  to 
identify  different  local  features  in  an  object.  The  evidence  provided  by  each  feature-type  is  then  fused  with 
a  neural  network  to  produce  a  decision.  Both  of  these  studies  employed  user-specified  wavelets,  although 
the  obvious  utility  of  adaptivity  will  be  described.  The  third  study  [11],  based  on  detecting  objects  in  active 
sonar  returns,  demonstrates  the  ability  of  a  combination  of  wavelet  feature  detectors  and  a  neural  network 
to  adaptively  determine  wavelet  features  that  provide  the  most  discriminatory  information. 

2  Wavelets 

The  wavelet  transform  (WT)  is  a  powerful  technique  for  representing  data  at  different  scales  and  frequen¬ 
cies  through  constant-Q  bandpass  filters,  e.g.  [12].  Wavelets  are  especially  attractive  from  the  standpoint  of 
neural  networks  because  the  human  ear  computes  an  approximate  WT  [13],  and  the  eye  has  been  shown  to 
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(1) 


have  wavelet-like  receptive  fields  [4].  In  1-D,  the  continuous  WT  is  given  by 

W,ia,b)  =  Jg*  s(t)  dt/a, 

where  g{t)  is  the  wavelet,  a  and  6  are  dilation  and  shift  parameters,  and  we  have  adopted  a  1/a  normalization 
rather  than  the  conventional  1/y/a  to  allow  for  unbiased  frequency  interpretation  [14].  The  work  in  Section 
3.3  is  based  on  a  Morlet  wavelet  given  by 


g(t)  =  exp(-t*/2  +  jSt).  (2) 

One  of  the  powerful  properties  of  the  WT  is  the  freedom  to  choose  a  wavelet  basis,  with  the  primary 
requirement  being  that  the  wavelet  have  0  mean.  Of  course,  for  classification  features,  we  do  not  compute 
a  full  WT,  but  only  compute  the  wavelet  or  wavelet-derived  features  needed.  The  completeness  of  the  WT 
is  unneeded,  since  there  is  no  need  to  reconstruct  the  original  signal  for  classification.  In  our  work,  we  have 
sampled  the  continous  WT  at  discrete  shifts  and  dilations.  Although  a  wavelet  chosen  to  fit  a  particular 
application  are  similar  to  banks  of  scaled  matched  filters,  wavelets  often  perform  better  because  of  their 
0-mean  nature  which  eliminates  background  areas  through  sensitivity  to  edges  of  particular  shapes  [15]. 
This  is  demonstrated  in  Sec.  3.  Also,  a  formulation  has  been  developed  to  linearly  combine  the  wavelet 
function  at  different  dilations  to  produce  scale-invariant  wavelet  functions  [16],  which  avoid  the  additional 
computation  required  by  a  bank  of  filters. 

The  discrete  WT  has  received  tremendous  attention  (see  [12]  for  an  overview)  and  a  fast  0(N)  algorithm 
exists.  We  do  not  consider  that  here,  except  to  note  that  work  is  ongoing  to  allow  more  freedom  to  choose 
a  particular  wavelet  waveform  while  still  allowing  a  fast  algorithm  [17,18]. 

3  Case  Studies 

3.1  Multispectral  Imagery 

One  band  from  a  set  of  six-band  multispectral  imagery  [9]  is  shown  in  Pig.  la.  The  six  480  x  720-pixel 
spectral  bands  have  wavelengths  evenly  spaced  between  400^-900nm.  In  the  foreground  are  the  blob-shaped 
objects  we  wish  to  detect,  and  various  types  of  clutter  are  visible. 

For  comparison,  a  perceptron  with  two  layers  of  weights  was  trained  with  the  six  spectral  values  from 
one  pixel  forming  a  feature  vector.  The  training  set  consisted  of  21  target  pixels  (the  centers  of  half  the 
targets)  and  279  clutter  pixels.  The  classification  of  all  pixels  (training  and  test)  is  shown  in  Fig.  lb. 
The  classification  results  are  poor,  with  some  blobs  not  detected  and  many  false  alarms,  especially  along 
the  boundaries  between  ground  types.  Blobs  on  the  right  side  of  the  image  are  missed  or  poorly  detected 
because  a  camera  nonuniformity  causes  the  right  side  of  the  image  to  appear  slightly  lighter  than  the  left. 

For  the  wavelet  processing,  the  wavelet  was  chosen  to  have  a  positive  centr2d  elliptical  area  matching 
the  blobs  in  Figure  la,  with  a  surrounding  negative  area  so  that  the  function  integrates  to  zero  (on-center, 
off-surround).  We  have  chosen  one  wavelet  with  only  a  single  scale  and  orientation  since  we  know  the  target 
size  and  orientation  for  this  application.  A  straightforward  method  of  detecting  other  scales  would  be  to  use 
other  wavelet  dilations.  A  preferable  alternative  has  been  developed  to  linearly  combine  the  wavelet  function 
at  different  dilations  to  produce  scale-invariant  wavelet  features  [16].  The  wavelet  was  correlated  with  each 
spectral  band,  and  another  perceptron  with  two  layers  of  weights  was  trained  on  the  wavelet-preprocessed 
spectral  bands  in  the  same  manner  as  before.  The  classification  results  are  shown  in  Fig.  2c.  Every  blob 
has  been  detected  and  the  only  false  alarms  are  a  few  small  ueas  of  a  resolution  chart.  Even  these  few  false 
alarms  could  be  eliminated  by  discarding  detections  that  contain  only  2-3  pixels  (at  the  cost  of  missing  a 
few  of  the  blobs).  Thus,  wavelet  preprocessing  significantly  improves  over  classification  of  the  raw  data,  and 
is  robust  to  the  camera  nonuniformity  and  diverse  clutter  types  present  in  the  image. 

Although  we  did  not  test  adaptive  wavelet  techniques  for  this  case  study,  these  would  be  beneficial  in 
optimizing  the  size  of  the  wavelet’s  negative  area,  in  adapting  the  waveform  to  different  sensors,  and  in 
computing  additional  wavelets  to  reject  clutter. 
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3.2  Sidescan  Sonar  Imagery 

A  sideacan  sonar  image  is  shown  in  Fig.  2a  [10].  Objects,  predominantly  in  the  righthand-side  of  Fig.  2s, 
i4>pear  as  highlights  with  shadows  extending  in  the  direction  of  sound  propagation.  As  in  Section  3.1  and  as 
expected,  neural  network  classification  of  the  raw  data  produced  poor  results.  In  applying  wavelets  to  this 
problem,  two  wavelet  filters  were  selected,  one  with  an  on-center,  off-surround  form  to  detect  highlights,  and 
the  other  a  similar  but  horizontally  elongated  filter  to  detect  shadows.  Each  wavelet  at  different  scales  was 
correlated  with  the  input  image.  A  single-layer  perceptron  with  two  weights  (one  corresponding  to  a  pixel 
in  the  highlight  correlation  image  and  one  to  a  pixel  in  the  shadow  correlation  image)  was  trained  to  fuse 
evidence  from  the  two  correlation  images  at  a  particular  scale.  It  was  shown  that  the  same  neural  network 
would  also  properly  fuse  the  correlation  images  at  different  scales.  The  classification  output  is  shown  in 
Fig.  2b,  produced  with  wavelets  at  a  scale  matching  most  of  the  targets.  The  results  are  excellent.  (The 
responses  in  the  central  swath  would  normally  be  gated  out,  but  are  included  in  this  image.)  Wavelets  at 
different  scales,  in  copjunction  with  the  neural  network,  were  demonstrated  to  detect  objects  at  different 
scales  (results  shown  in  [10]).  Again,  the  wavelet  features  are  demonstrated  to  be  robust  to  a  wide  range  of 
clutter  levels  and  types. 

3.3  Acoustic  Signals 

Fig.  3a  shows  representative  acoustic  backscatter  from  a  metallic  object  and  natural  clutter  when  en- 
sonified  with  a  linear-FM  transmit  signal  [11].  The  two  signal  classes  have  similar  strengths  and  durations, 
and  there  is  significant  intraclass  variation,  which  creates  a  challenging  pattern  recognition  task.  Synthetic 
reverberation  noise  (20dB)  was  added  to  all  signals  to  increase  readism.  The  data  was  divided  into  training 
and  test  sets  with  222  and  216  returns,  respectively  (multiple  aspects  and  multiple  days  of  collection).  Fig. 

3  shows  corresponding  wavelet  transform  magnitudes,  computed  with  a  Morlet  wavelet. 

To  find  the  groupings  of  wavelet  features,  the  algorithm  adapts  Gaussian  patches  in  the  time-scale  mag¬ 
nitude  space.  In  that  sense,  it  is  similar  to  a  radial  basis  classifier,  but  adaptively  computes  wavelet  features 
rather  than  finding  boundaries  in  an  existing  feature  space.  The  classifier  output  v„  for  the  nth  training 
sample  is  given  by 

..  =  r  {e».  [-1  ((=^)’  ^  (5^)’)]  .  (3) 

where  wt,  mat,  mu,  (Xak  and  ue  the  weight,  mean  and  standard  deviations  of  the  /bth  Gaussian  patch, 
and  7(z)  =  1/[1  -I-  exp(— z)]  (normally  denoted  by  <r,  but  changed  here  to  avoid  notational  conflict).  The 
classification  parameters  are  optimized  by  minimizing  the  mean  squared  error  between  the  actual  and  desired 
outputs  using  gradient  descent. 

The  resulting  adaptive  wavelet  classifier  (with  30  Gaussian  patches)  gives  a  test  set  error  rate  of  0.083  vs 
0.130  from  classifiying  power  spectral  features.  (This  is  one  result  from  numerous  tests  at  different  levels  of 
reverberation  noise,  which  all  have  the  same  qualitative  difference.)  This  difference  is  not  surprising,  since  the 
wavelet  classifier  makes  use  of  time  and  frequency  information,  whereas  the  power  spectra]  classifier  only  uses 
frequency  information.  The  adaptive  wavelet  features  are  robust,  in  that  they  generalize  well  to  the  test  data 
that  has  considerable  intraclass  variation  from  the  training  data,  and  also  contains  reverberation  noise.  Fig. 

4  displays  the  adaptive  wavelet  classifier  weights  corresponding  to  the  time/frequency  information  in  Fig. 
3b.  The  Gaussian  patches  can  be  seen  as  well  as  the  areas  that  are  given  heaviest  weight  for  classification, 
particularly  the  trailing  edge  of  the  return.  Ongoing  work  for  this  application  emphasizes  computing  a  small 
set  of  wavelet-based  features  that  can  be  quickly  computed,  rather  than  integrating  sets  of  WT  magnitudes. 

4  Conclusion 

The  three  case  studies  summarized  here  demonstrate  the  promise  of  wavelet  and  adaptive  wavelet  classi¬ 
fiers  for  practical  applications  that  require  robust  features.  This  work  is  heading  toward  classifiers  that  use 
multiple  adaptive  wavelet  waveforms  to  provide  discriminatory  information  that  is  local  in  time/space  and 
frequency /scale.  The  wavelet  waveforms  are  adaptive  to  particular  applications,  and  can  incorporate  scale 
invariance  through  appropriate  combination  of  wavelet  coefficients. 
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(b) 

Figure  1:  a)  400nin  band  of  multispectral  electro-optical  imagery,  b)  detections  by  neural  network  operating 
on  spectral  data,  c)  detections  by  neural  network  operating  on  wavelet-preprocessed  data. 
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(b) 

Figure  2:  a)  sidescan  sonar  image,  b)  detections  by  neural  network  fusing  evidence  from  preprocessing  with 
highlight  and  shadow  wavelets. 
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Figure  4:  Adaptive  wavelet  classifier  weights  (frequency  vs.  time). 
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Abstract 

The  presentation  of  a  series  of  patterns  to  a  Kohonen  network  (Self-Organizing  Feature  Map)  is  known 
to  generate  a  "trajectory"  that  may  itself  cany  meaningful  information.  Handwritten  signature  verification 
is  a  typical  application  in  which  each  pattern  represents  a  specific  portion  of  a  given  signature.  An 
efficient  way  to  compare  an  unknown  trajectory  to  a  reference  trajectory  consists  in  using  a  string 
comparison  technique  developed  by  Wagner  &  Fischer  (and  related  to  dynamic  programming).  Editing 
costs  (for  insertions/deletions/substitutions)  are  determined  according  to  the  response  of  the  feature  map 
following  presentation  of  the  sequences  to  be  compared.  An  example  in  signature  verification  is  given 
to  illustrate  the  method. 


L  Introduction 

The  neural  network  model  called  Self-Organizing  Feature  Map  (SOFM  -  described  by  Kohonen  in  [3], 
[4],  [3]  and  [6])  has  been  successfully  applied  to  pattern  classification  (along  with  LVQ)  and  vector 
quantization  problems.  However,  in  several  situations,  sequences  of  patterns  must  be  dealt  with  in 
addition  to  the  recognition  of  the  patterns  themselves.  Upon  presentation  of  such  a  sequence,  specific 
cells  are  activated  one  after  the  other,  thus  generating  a  trajectoiy  over  the  entire  map  that  is  typical  of 
the  input  sequence.  [S]  has  pointed  out  the  above-described  phenomenon  when  studying  the  feature  map 
in  speech  recognition;  other  applications  may  produce  the  same  results:  character  recognition  based  on 
the  analysis  of  successive  letter  segments  ([11],  [12]),  study  of  EEG  signals  during  sleep  [14],  signature 
verification  [9],  etc.  These  applications  show  the  importance  of  evaluating  the  correctness  of  a  test 
trajectoiy  generated  over  a  SOFM  compared  to  reference  trajectories. 

This  paper  is  mainly  aimed  at  the  signature  verification  problem,  even  though  the  proposed  technique 
could  be  applied  to  other  types  of  sequences.  Let  us  consider  a  set  of  reference  signatures  obtained  from 
an  individual;  these  signatures  are  segmented  into  various  elements  from  which  feature  vectors  are 
extracted;  the  vectors  are  then  grouped  to  form  a  training  set  for  a  SOFM;  after  learning,  each  cell  of  the 
SOFM  is  tuned  to  the  shape  and  velocity  profiles  of  a  specific  portion  of  the  individual's  signature;  when 
a  test  signature  is  to  be  validated,  the  resulting  trajectoiy  (potentially  distorted)  must  be  compared  to  those 
of  the  references,  taking  into  account  possible  erroneous,  missing  or  additional  elements. 

The  rest  of  the  paper  is  divided  into  six  sections.  Section  II  describes  the  type  of  SOFM  used.  Section 
III  presents  sequence  comparison  problems  as  well  as  solutions  already  proposed  in  the  literature.  In 
section  IV,  we  present  a  brief  summary  of  the  Wagner  &  Fischer  algorithm  and  the  necessary  ajustments 
are  explained.  Finally,  in  section  V,  an  example  using  simple  signatures  illustrates  the  trajectory 
comparison  process. 
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n.  Description  of  the  SOFM 


The  SOFM  networic  is  seen  as  a  tool  for  the  visualization  of  metric-topological  relationships  in  a 
multidimensional  vector  space.  It  can  be  used  m  pattern  classification,  clustering  and  vector  quantization 
since  prototype  vectors  are  formed  during  learning  and  correspond  to  the  centroid  of  sub-domains  within 
the  overall  space  (these  prototypes  are  actually  weight  vectors  associated  to  the  cells).  When  an  input 

vector  /(t)is  presented  to  the  network,  cells  compete  against  each  other  and  the  cell  having  its  weight 

vector  W(t)closest  to/(0  outputs  the  strongest  response  that  allows  the  inhibition  of  all  the  others  (the 
cell  is  said  to  win  the  competition).  The  usual  measures  of  proximity  include  the  Euclidian  distance  and 

the  dot  product  between  /(t)and  ^(t).  The  dot  product  will  be  used  here  and  the  justification  will  be 
given  later  in  section  IV.  The  activation  of  a  cell  (namely  cell  i)  is  then  given  by: 


Act,(t)  =  /(f)#,(t) 


(1) 


where  /(t)and  ^(t)are  normalized.  Learning  is  carried  out  in  a  classical  manner  ([4],  [5],  [6]  and  [13]): 


■  presentation  to  the  SOFM  of  a  vector  drawn  from  the  training  set; 

•  determination  of  the  winning  cell  c  such  that  Act^  = 

*  adaptation  of  the  weight  vectors  associated  to  cells  belonging  to  neighborhood  Nc(t)  centered  on 
cell  c; 


with  N,(t)  ->  c;  furthermore. 


^,(0  ^  hjrmt) 

|H>;(t)  +  hirj)Iit)\ 

^i(t) 

hirj)  =  aide 


V  ieN^Ct) 

otherwise 


(2) 


(3) 


where  a(t)  ->  0  with  t  increasing,  and  r  is  the  distance  between  cell  i  and  cell  c  within  a 
neighborhood  N,(t)  of  diameter  o^.  Ceils  are  labelled  with  a  code  (e.g.  a  letter);  upon  recall,  a 
given  sequence  would  produce  the  activation  of  a  series  of  cells  which  may  be  summarized  by 
a  string  of  codes  (e.g.  "a  e  f  d  c  b"). 


The  question  now  is  how  to  compare  such  a  sequence  to  a  reference  sequence. 


nL  Methods  for  sequence  comparison 

Since  a  sequence  possesses  an  underlying  structure  (it  is  made  of  a  symbol  'a'  followed  by  a  symbol 
etc.),  one  could  consider  the  use  of  a  syntactic  method  (i.e.  grammar).  Production  rules  would  be 
extracted  from  the  references,  and  test  sequences  would  be  recognized  depending  on  whether  they  fit  the 
constructed  grammar  or  not.  However,  many  reasons  prevent  us  from  resorting  to  this  technique: 

difficulty  to  extract  production  rules.  Extra  rules  should  be  added  to  the  set  of  basic  production 
rules  in  order  to  handle  missing  or  additional  elements:  this  is  a  type  of  problem  that  syntactic 
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methods  have  always  had  trouble  to  cope  with. 

difiiculty  to  deal  with  variations  inherent  in  the  use  of  feature  maps.  This  turns  out  to  be  a  key 
problem  since  patterns  of  a  test  sequence  are  oAen  perceived  as  more  or  less  pronounced 
variations  of  those  of  reference  sequences.  If  the  variations  are  significant,  the  expected  cells  will 
remain  inactive:  their  neighbors  will  come  out  (according  to  the  fact  that  the  mapping  produced 
is  said  to  be  "topology-preserving",  as  shown  in  [4],  which  implies  that  neighboring  cells  possess 
contiguous  weight  vectors  in  the  weight  space).  Constructing  a  coherent  grammar  capable  of 
handling  all  possibilities  is  nearly  impossible. 

Another  approach  would  favor  the  use  of  other  neural  networks  with  sequence  processing  capabilities. 
Networks  like  that  of  Tank  &  Hopfield  [15],  the  Time-Delay  of  [10]  or  recurrent  architectures  (e.g.  [1]) 
could  be  considered.  Although  some  of  them  were  designed  to  process  distorted  sequences  (whether  by 
using  temporal  windows  [15]  or  by  learning  [10]),  the  “variations  problem"  remains  unsolved:  these  nets 
expect  specific  input  patterns  or  features,  and  no  substitutes  are  allowed. 

We  can  finally  draw  ideas  from  the  study  of  string  comparison:  how  to  optimally  transform  a  test  string 
into  a  reference  string  by  using  sic. pie  editing  operations  (insertions/deletions/substitutions)  associated  to 
costs.  The  general  string  comparison  algorithm  is  derived  from  dynamic  programming  (DP)  techniques 
and  it  has  been  used  for  tackling  a  large  number  of  problems  (see  [7]  and  [8]  for  a  review  of  these 
techniques  and  their  applications).  The  same  concept  has  been  explored  independently  by  Wagner  & 
Fischer  (in  [16])  when  they  addressed  the  issue  of  comparing  string  typed  on  a  keyboard  with  reference 
keywords.  They  came  up  with  a  similar  approach  to  DP  that  computes  a  distance  between  two  strings 
based  on  specific  costs  (depending  on  the  probability  of  typing  error,  for  example;  in  that  regard,  the  cost 
for  confusing  V  and  'e'  should  obviously  be  much  lower  than  that  of  confusing  V  and  'p').  The  analogy 
with  the  problem  explained  in  introduction  is  striking:  the  feature  map  may  be  seen  as  a  keyboard  where 
a  sequence  of  keys  making  up  a  word  corresponds  to  a  sequence  of  activated  cells.  As  two  neighbouring 
keys  can  be  substituted  with  high  probability,  in  the  same  way  two  neighboring  cells  represent  (by  their 
weight  vectors)  similar  shapes  or  patterns:  therefore  the  cost  for  their  substitution  could  be  a  function  of 
their  activations  before  competition. 

This  analogy  naturally  leads  to  the  use  of  the  Wagner  &  Fischer  algorithm  as  a  means  of  measuring  the 
dissimilarity  between  two  sequences  generated  by  a  feature  map. 


IV.  Wi^er  &  Fischer  algorithm 

Here  is  a  brief  summaiy  of  the  Wagner  &  Fischer  algorithm: 

Let  A  and  B  be  two  strings. 

A  =  {aj}  ;  i  =  1  ..  length(A) 

B  =  {bj}  ;  j  =1  ..  length(B) 

In  order  to  transform  A  into  B,  three  operations  are  allowed: 

•  substitution  a,-  ->  bj;  cost  =  y  (aj  ^  bj) 

•  insertion  X  -*  bj;  cost  =  y  (X  ->  bj) 

•  deletion  a,  X;  cost  =  y  (a;  X) 
where  X  is  the  null  string. 
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The  distance  between  strings  A  and  B  is  the  accumulation  of  small  distances  D(i,  j)  given  by: 

Dii-Uj-V)  + 

Z)(i,j)  “  min  i)(i-l,y)  + 

IKUj-\)  * -liX-b) 

The  cost  function  y  must  obey  fundamental  properties  so  that  the  distance  can  be  considered  a  metric  [7]: 

1.  Y(^->bj)  t  0  :  nonnegative  costs 

2.  Y(ai~^ai)  ~  0  :  cost  =  0  for  two  identical  symbols 

3.  Y(«i^'l>j)  =  Y(bj->ai)  :  symmetry 

4.  Y(ai~>bj)  ^  y(ai->^)  +  Y(^->bj)  :  triangular  inequality  condition 

The  determination  of  costs  is  a  key  issue.  We  set  costs  Y(X->bj)  and  Y(aj->X)  to  1:  they  are  considered 
as  editing  costs  resulting  firom  the  deletion  or  the  insertion  of  an  element  with  respect  to  one  of  the 
sequences  under  analysis.  As  for  the  substitution  cost  y(ai->’bj),  it  should  remain  low  if  the  element 
corresponding  to  a,  activates  a  nearby  cell  instead  of  that  activated  by  the  element  corresponding  to  bj, 
or  become  high  for  the  alternate  case.  Intuitively,  it  is  reasonable  to  suggest: 

=  Act^  -  Act^  (S) 

where  Act,  =  activation  of  the  winning  cell  (symbol  Sj)  after  presentation  of  a  test  vector 

Act„f  =  activation  of  the  cell  (symbol  bj)  that  should  have  won  according  to  reference  B 
Moreover,  knowmg  that  Act,  >  A.ct  ^  since  c  is  the  winning  cell,  then  it  follows  that  0  <  y(aj->bj)  <  2. 

If  die  winning  cell  is  the  same  as  the  expected  cell,  then  the  corresponding  patterns  from  the  two  strings 
match  together  and  can  be  substituted  with  no  cost;  on  the  other  hand,  if  the  corresponding  patterns  do 
not  sustain  comparison,  the  activation  of  the  expected  cell  will  be  much  lower  than  that  of  the  cell  c,  thus 
yieldmg  a  high  substitution  cost  that  is  likely  to  favour  a  deletion  or  an  insertion  operation.  It  is  important 
to  point  out  that  this  cost  is  bounded  since  the  activation  of  the  winning  cell  c  is  less  or  equal  to  1  (with 

normalized  vectors  /(Oand  ^0)-  'I'his  fact  ensures  that  the  property  of  triangular  inequality  is  respected. 
In  addition,  one  can  easily  notice  that  all  properties  stated  in  section  IV  are  respected,  except  that  of 
symmetry:  given  that  comparisons  are  made  between  test  sequences  and  known  references,  only  the  test 
vectors  are  supposed  available  Qt  is  not  necessary  to  keep  the  feature  vectors  that  produced  the  reference 
sequences).  As  a  consequence,  a  comparison  between  a  reference  sequence  and  a  test  sequence  cannot 
be  performed.  In  any  case,  it  is  mentioned  in  [7]  that  an  asymmetric  distance  is  accepted  in  situations 
where  an  unknown  sequence  is  compared  against  template  sequences  (in  speech  recognition,  for  example). 


V.  An  example 


In  order  to  illustrate  the  idea,  a  basic  experiment  has  been  conducted  with  signatures  made  of  simple 
straight  line;  (figs.  1-2).  Points  joined  by  the  lines  are  actual  segmentation  points.  Feature  vectors  are 
then  constructed  with  specific  measures  obtained  from  the  curves  (depicted  in  fig.  3),  thus  carrying  the 
following  information; 


Vector 


sinO  ,  cosO  ] 


(6) 


Note  that  pairs  of  segments  are  overlapping:  segment  #1  joined  to  segment  #2,  segment  #2  joined  to 
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segment  #3,  etc.  so  that  elements  are  chained  together.  Due  to  the  selected  coding,  each  segment  is 
described  with  respect  to  the  prior  segment  by  angle  0.  Since  the  features  are  comparable  (they  are 
piecewise  normalized  so  that  the  vector  length  remains  constant),  the  vector  can  be  transformed  into  a  unit 
vector,  as  required  by  the  feature  map  learning  algorithm.  Learning  begins  with  a(0)  =  0.9  and  Nj(0)  = 
90%  of  map  surface  (a  4x4  map).  The  training  vectors  are  those  extracted  from  references  R1  and  R2. 
The  so  obtained  map  (sketched  in  figure  4)  sets  up  an  analysis  framework  specific  to  the  genuine  signer. 
Note  that  in  more  advanced  signature  verification  systems  the  vector  (3)  is  more  complete  since  the 
varying  curvature  of  the  pen  trace  between  segmentation  points  along  with  its  dynamics  is  taken  into 
account  [9]. 

Reference  and  test  sequences  generated  trajectories  shown  in  fig.  3.  A  careful  study  of  test  signatures  Ti 
reveals  that  Tl  is  similar  to  R1  and  R2,  T2  counts  an  extra  segment,  and  T3  is  totally  different  from  R1 
or  R2.  We  expect: 

*  small  distance  between  Tl  and  Rj 

■  distance  between  T2  and  Rj  at  least  greater  to  1  (the  cost  due  to  the  deletion  of  the  additional 
segment) 

*  large  distance  between  T3  and  Rj 

Comparisons  have  been  made  using  the  Wagner  &  Fischer  algorithm  (described  in  section  IV)  and  the 
optimal  matching  paths  appear  in  figs.  6  to  8.  We  see  that  Tl  fully  matches  any  reference;  the  additional 
segment  in  T2  has  been  removed;  deletion  of  elements  in  T3  and  insertion  of  others  arc  such  that  the 
resulting  signatures  now  look  similar.  Moreover,  distance  measures  compiled  in  table  1  confirm  our 
estimations.  This  simple  example  shows  that  comparison  of  sequences  generated  by  a  feature  map  can 
be  done  with  a  DP  technique:  in  a  classification  context,  an  unknown  sequence  could  be  compared  against 
a  set  of  prototype  sequences  and  the  smallest  distance  would  indicate  the  class  the  input  sequence 
probably  belongs  to. 


Distance  Ti  vs  Rj 

Reference  Ml 

Reference  M2 

Test  #1 

0.0234 

0.0280 

Test  #2 

1.2494 

1.2786 

Test  #3 

4.3032 

4.3032 

Table  1.  Distances  between  "signatures"  as  computed  by  the  DP  algorithm 


VL  Discussion 

The  major  constraint  limiting  the  wide  application  of  the  algorithm  is  undoubtedly  the  requirement  about 
unit  feature  vectors,  which  is  sometimes  difficult  to  meet  (particularly  in  our  example  where  the  feature 
vector  was  a  set  of  interrelated  geometrical  measurements).  Another  definition  of  activation  for  the 
feature  map  cells  would  be  possible  provided  that  normalization  of  the  overall  map  response  is  made  in 
order  to  guarantee  a  set  of  costs  that  is  in  agreement  with  the  previousi,'  stated  triangular  inequality 
principle.  An  alternate  activation  function  for  each  ceil  (sigmoidal  instead  of  the  standard  linear  in  recall 
mode)  could  do  the  job,  but  the  gain  and  the  inflexion  point  of  the  sigmoid  would  become  critically 
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important  parameters. 


Other  pertinent  adjustments  may  also  affect  the  choice  of  costs:  in  some  cases,  elements  of  a  sequence 
could  have  more  importance  than  others,  e.g.  in  signature  verification,  where  the  various  portions  of  a 
signature  may  exhibit  very  different  sizes,  and  thus  unequal  importance.  The  retained  cost  values  should 
be  weighted  so  that  the  importance  of  each  element  is  considered. 


VD.  Conclusion 

In  this  paper,  we  showed  that  a  string  comparison  technique  related  to  dynamic  programming  (namely  the 
Wagner  &  Fischer  algorithm)  was  capable  of  measuring  the  dissimilarity  between  two  sequences  generated 
by  a  SOFM  neural  network.  Costs  associated  to  editing  operations  are  chosen  according  to  the  map 
response. 
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Abstract 

Fuzzy  ARTMAP  is  a  supervised  learning  system  which  includes  nonlinear  dynam¬ 
ics  in  the  learning  process.  A  modified  learning  rule  which  enables  ’forgetting’  of 
insignificant  information  is  introduced.  A  handwritten  digit  recognition  task  is  ap¬ 
plied  to  evaluate  the  performance  of  the  network  in  a  real,  noisy  environment.  Two 
different  preprocessing  algorithms,  based  on  either  positional  or  directional  informa¬ 
tion  extracted  from  the  image,  are  used.  The  latter  algorithm  is  more  successful.  The 
fuzzy  ARTMAP  network  is  compared  to  the  K  Nearest-Neighbor  (KNN)  algorithm. 
Although  the  modified  learning  rule  improves  the  performance  of  fuzzy  ARTMAP, 
KNN  still  performs  somewhat  better.  However,  the  amount  of  memory  and  the  length 
of  recognition  time  required  by  fuzzy  ARTMAP  are  significantly  smaller. 


ARTMAP,  KNN,  and  the  ZIP  Code  Database 

ARTMAP  [Carpenter,  Grossberg  and  Reynolds,  1991]  is  a  neural  network  architecture  that 
performs  incremental  supervised  learning  of  recognition  categories  and  multidimensional 
maps  in  response  to  binary  input  vectors  presented  in  an  arbitrary  order.  Fuzzy  ARTMAP 
[Carpenter  et  o/.,  1992]  incorporates  fuzzy  logic  [Zadeh,  1965]  to  classify  inputs  by  a  fuzzy 
set  of  features  indicating  the  extent  to  which  the  feature  is  present.  To  evaluate  the  per¬ 
formance  of  the  system  on  a  difficult  problem,  the  handwritten  digit  recognition  task  was 
proposed.  Digits  were  obtained  from  the  United  States  Postal  Office  of  Advanced  Technology 
Handwritten  ZIP  Code  Database  (1987)  which  consists  of  five-digit  ZIP  codes.  Separation 
of  digits,  beyond  the  scope  to  this  project,  was  performed  manually. 

The  K  Nearest- Neighbor  classifier  has  been  examined  on  handwritten  recognition  tasks 
[Lee,  1991].  Compared  to  a  backpropagation  network  which  uses  local  receptive  fields  and 
shared  weights,  and  to  radial  basis  function  networks,  it  provides  a  similarly  low  error  rate. 
However,  KNN  requires  a  very  large  amount  of  memory,  and  is  slow  in  classification.  The 
KNN  algorithm  chooses  a  winning  category  based  on  the  K  training  points  that  lie  nearest 
to  a  test  point.  It  is  used  for  comparison  with  fuzzy  ARTMAP  performance. 

Noisy  images  are  passed  through  several  steps  of  preprocessing  before  being  presented 
to  the  recognition  network.  First,  the  background  noise  is  removed  from  figures  and  trans¬ 
formation  of  images  into  a  scale- rotation  invariant  representation  is  performed.  Then,  two 
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Figure  1:  Examples  of  ZIP  codes  from  United  States  Postal  Office  of  Advanced  Technology 
Handwritten  ZIP  Code  Database  (1987). 


types  of  preprocessing  are  applied  to  construct  a  one-dimensional  input  vector  for  the  clas¬ 
sifier.  The  first  method  incorporates  positional  information  while  the  second  incorporates 
directional  information.  Both  fuzzy  ARTMAP  and  KNN  show  slightly  better  performance 
with  the  directional  preprocessor  than  with  the  positional  preprocessor.  Thus  directional 
preprocessing  is  used  to  compare  system  performance.  Simulations  show  that  there  is  a 
trade-off  between  the  predictive  accuracy  and  the  number  of  nodes  created  by  the  classi¬ 
fiers.  The  following  sections  describe  the  preprocessing  steps,  the  classifiers,  and  simulation 
results. 


ZIP  code  data  and  preprocessing  steps 

ZIP  codes  from  the  postal  service  data  base  use  a  great  variety  of  sizes,  styles,  and  in¬ 
struments.  Both  the  training  set  and  the  test  set  contain  numerous  examples  that  are 
ambiguous,  extremely  noisy,  and  can  be  misclassified  by  people.  Gray  scale  images  consist 
of  five  digits,  and  some  extraneous  marks  may  also  be  present,  such  as  pieces  of  letters  from 
the  address  label  or  underlining.  Digits  in  a  zip  code  may  overlap  and  they  are  surrounded 
by  the  background  noise.  Some  examples  are  shown  in  Figure  1. 

Most  of  the  background  noise  has  lower  intensity  than  the  digits  and  was  removed 
by  thresholding.  The  level  of  the  threshold  was  automatically  defined  by  the  analysis  of 
histograms  of  images,  computed  as  an  average  over  the  image  intensity  plus  empirical  value 
that  defines  the  range  of  noise  intensity  fluctuations.  After  thresholding,  small  spots  of  high 
intensity,  several  pixels  in  width  that  do  not  belong  to  digits  may  remain  in  the  image.  A 
median  filter  removed  these  points.  This  filter  substitutes  the  pixel  intensity  value  by  the 
middle  value  over  its  neighbors,  removing  isolated  fluctuations  of  intensity  in  small  areas. 

Digits  in  ZIP  codes  have  different  inclinations  (Figure  1).  To  allow  separation  and  to 
remove  rotation  uncertainty,  digits  were  transformed  into  an  invariant  vertical  position. 
The  main  direction  was  defined  as  the  one  with  the  highest  activity  obtained  during  con¬ 
volution  of  the  image  with  orientation  selective  filters  of  different  orientations.  The  affine 
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Figure  2:  Digits  transformed  to  a  vertical  position,  with  baickground  noise  removed. 

transformation  with  the  center  in  the  left  upper  corner  of  the  image  transforms  the  image 
to  make  the  main  direction  vertical.  This  positions  digits  in  an  upright  position  (Figure 
2).  After  manual  separation,  a  linear  transformation  fit  each  digit  to  the  16  *  16  box.  To 
transform  2-D  into  a  1-D  classifier  input,  coarse  coding  [Seibert,  Waxman,  1990]  and  direc¬ 
tional  preprocessing  were  used.  In  coarse  coding,  a  featural  component  is  the  convolution  of 
pixel  intensities  in  the  image  with  large  overlapping  Gaussian-weighted  receptive  field.  The 
receptive  field  was  truncated  at  a  diameter  of  3  —  4  pixels,  and  fields  overlapped  by  half. 

Orientation  selective  filters  in  the  form  of  difference-of-Gaussians  were  used  in  directional 
preprocessing.  The  image  was  divided  into  16  cells  of  4  *  4  pixels  size.  The  image  in  each 
cell  was  convolved  with  filters  of  several  orientations,  the  number  of  orientations  was  fixed 
at  6  throughout.  Then,  for  each  cell  and  each  filter  the  maximum  activity  over  the  resulting 
image  was  defined,  and  it  was  considered  as  one  feature  in  the  input  vector  to  classifier. 
This  procedure  provides  a  network  with  the  information  about  directional  preferences  of 
digits  in  different  locations.  Input  vectors  obtained  with  the  positional  and  the  directional 
preprocessings  contained  49  features  and  64  features  respectively. 


Fuzzy  ARTMAP  and  modified  learning  rule 

Fuzzy  ARTMAP  (Figure  3)  includes  a  pair  of  Fuzzy  ART  modules  {ARTa  and  ARTb) 
[Carpenter,  Grossberg  and  Rosen,  1991]  linked  together  via  an  inter-ART  associative  mem¬ 
ory  F“*  that  is  called  a  map  field.  During  supervised  learning,  ARTa  receives  a  stream 
of  input  patterns  and  ART6  receives  a  stream  of  patterns,  where  bl’’!  is  the  correct 

prediction  given  These  modules  are  linked  by  an  associative  learning  network  and  an 
internal  controller  that  ensures  autonomous  system  operation  in  real  time.  The  controller 
is  designed  to  create  the  minimal  number  of  ARTa  recognition  categories  needed  to  meet 
accuracy  criteria. 

Vigilance  parameter  pa  calibrates  the  minimum  confidence  that  ARTa  must  have  in  a 
recognition  category,  or  hypothesis,  activated  by  an  input  alPl  in  order  to  ARTa  to  accept 
that  category,  rather  than  search  for  a  better  one  through  an  automatically  controlled 
process  of  hypothesis  testing.  Lower  values  of  pa  enable  larger  categories  to  form.  These 
lower  Pa  values  lead  to  a  broader  generalization  and  a  higher  degree  of  code  compression.  A 
predictive  failure  at  ARTb  increases  Pa  by  the  minimum  amount  needed  to  trigger  hypothesis 
testing  at  ARTa,  using  a  mechanism  called  match  tracking.  Match  tracking  sacrifices  the 
minimum  amount  of  generalization  necessary  to  correct  the  predictive  error.  Hypothesis 
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Figure  3;  Fuzzy  ARTMAP. 

testing  leads  to  the  selection  of  a  new  ARTa  category,  which  focuses  attention  on  a  new 
cluster  of  input  features  that  is  better  able  to  predict  Match  tracking  allows  a 
single  ARTMAP  system  to  learn  a  different  prediction  for  a  rare  event  than  for  a  cloud  of 
similar  frequent  events  in  which  it  is  embedded. 

An  ARTMAP  voting  strategy  is  based  on  the  observation  that  fast  learning  typically 
leads  to  different  adaptive  weights  and  recognition  categories  for  different  orderings  of  a 
given  trsuning  set,  even  when  the  predictive  accuracy  of  all  simulations  is  similar.  The 
different  internal  category  structures  cause  the  set  of  test  set  items  where  errors  occur  to 
vary  from  one  simulation  to  the  next.  The  voting  strategy  uses  an  ARTMAP  system  that 
is  trained  several  times  on  one  input  set  with  different  orderings.  The  final  prediction  for  a 
given  test  set  item  is  the  one  made  by  the  largest  number  of  simulations. 

Once  an  ARTa  category  (J)  is  chosen  whose  prediction  of  the  actual  ARTj  category  is 
correct,  match  tracking  is  disengaged,  and  resonance  occurs  at  ARTa-  During  resonance, 
learning  occurs  at  ARTo  according  to  the  equation 

wl“’'  =  «IA  +  (1  -  (1) 

where  fast  learning  corresponds  to  setting  0  =  \. 

The  categories  created  during  learning  may  be  represented  geometrically  as  multidimen¬ 
sional  “boxes”  in  the  space  of  input  vectors  [Carpenter  et  o/.,  1992].  The  important  feature 
of  learning  rule  (1)  is  that  it  allows  only  increases  in  the  size  of  these  boxes  during  learning. 
However,  such  a  rule  results  in  a  great  dependency  on  early  training  vectors  in  how  cate¬ 
gories  are  formed.  A  modified  learning  rule  introduced  here  reduces  this  dependency.  In 
the  process  of  learning,  the  winning  category  (J)  is  now  allowed  to  both  expand  and  shrink: 

WS”*>  =  (1-«WS*'>  +  ;9I,  (2) 
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preprocessing 

#  simulations 

avg.  #  of  categories 

%  correct  test  s 
without  voting 

et  predictions 
with  voting 

directional 

5 

484 

89.7-91.6 

92.5  -  93.0 

positional 

5 

509 

89.5-90.0 

91.9  -  92.2 

Table  1;  Fuzzy  ARTMAP  performance  with  directional  and  positional  preprocessing.  The 
system  was  trained  on  9720  training  exemplars  and  tested  on  the  remaining  2426  exemplars. 


where  is  relatively  small  (around  0.2).  Simulations  have  shown  that  the  recognition  ability 
of  the  network  with  learning  rule  (2)  is  slightly  increased  at  the  expense  of  reduced  code 
compression. 


Parameters  and  results 

Preliminary  simulations  were  used  to  choose  parameters  for  recognition  methods.  In  the 
KNN  algorithm  a  Euclidean  (Lj)  metric  was  used,  and  the  number  of  neighbors  {K)  was 
fixed  at  5. 

ART  dynamics  are  determined  by  a  choice  parameter  a  >  0,  a  learning  rate  parameter 
e  [0, 1],  and  a  baseline  vigilance  parameter  ^  €  [0, 1].  All  simulation  used  ^  =  0  and 
choice  parameter  a  =  1.0.  Fast  learning  with  ^  =  \  was  used  in  all  simulations  with 
fuzzy  ARTMAP  as  a  classifier;  for  ARTMAP  with  the  learning  rule  (2),  slow  learning 
with  =  0.2  was  employed.  All  inputs  were  normalized,  as  well  as  complement  coding 
[Carpenter,  Grossberg  and  Rosen,  1991]  was  used;  the  number  of  voters  was  equal  to  7. 

The  performance  was  measured  on  the  test  set  of  2426  exemplars  after  the  system  was 
tr^ned  in  the  off-line  regime  on  9720  training  exemplars  presented  to  the  system  in  the 
random  order.  Most  fuzzy  ARTMAP  learning  occurs  during  the  first  epoch,  and  the  system 
achieves  100%  of  correct  prediction  on  the  training  set  in  about  20  epochs,  while  more  than 
99%  is  achieved  in  10  —  15  epochs.  The  results  of  fuzzy  ARTMAP  performance  on  the 
testing  set  are  shown  in  Table  1  for  both  types  of  preprocessing.  The  network  performed 
better  with  the  directional  preprocessing:  recognition  and  the  compression  rate  are  slightly 
higher.  The  same  relation  was  obtained  while  using  K  Nearest-Neighbor  classifier  -  with  the 
coarse  coding  preprocessing  93.7%  of  test  exemplars  were  correctly  recognized,  compare  to 
94.7%  with  the  directional  preprocessing. 

Table  2  shows  a  comparative  performance  of  kNN  classifier,  fuzzy  ARTMAP,  and  fuzzy 
ARTMAP  with  modified  learning  rule  (2).  All  the  results  were  obtained  with  the  directional 
preprocessing  by  using  the  same  training  and  testing  sets.  Trained  on  9720  inputs,  KNN 
correctly  recognized  94.7%  of  the  test  set  compare  to  92.8%  achieved  by  fuzzy  ARTMAP 
and  93.8%  achieved  by  fuzzy  ARTMAP  with  the  modified  learning  rule.  However,  fuzzy 
ARTMAP  compressed  memory  by  factor  of  20,  or  by  factor  of  8  with  the  modified  learning 
rule,  resulting  in  a  comparable  speed-up  of  test  set  recognition  time.  This  comparison  shows 
a  trade-off  between  the  number  of  nodes  in  the  network  and  the  level  of  performance  on  the 
test  set. 
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fuzzy  ARTMAP 

fuzzy  ARTMAP  -f 
modified  learning  rule 

KNN 

average  %  of  correct  test  set  predictions 

92.8 

93.8 

94.7 

average  number  of  committed  nodes 

484 

1261 

9720 

test  set  classification  time  (hours) 

0.3 

0.5 

5.1 

Table  2:  Performance  for  three  classifiers  on  handwritten  letter  recognition  task.  The 
directional  preprocessing  was  used  in  all  three  cases.  The  training  (9720  inputs)  set  and  the 
test  set  (2462  inputs)  were  the  same  for  all  networks. 
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ABSTRACT 

In  this  papa  an  adaptive  feature  extractk»  newest  neighbor  classifier  "AFNN”  is  proposed. 

The  AFNN  consists  of  a  linear  adaptive  feature  extractor  "AFE*,  m^ing  the  original 
f-dimensianal  input  to  a  lower  L -dimensional  feature  qnce  which  is  apptied  to  an  adaptive 
neaest  neighbor  classifier  "ANNC.  Both  the  AFE  and  ANNC  parameters  are  learned  simul¬ 
taneously  by  maximizing  the  mutual  infixmation  of  the  overall  classifier.  A  stochastic  complex¬ 
ity  criterion  is  developed  to  estimate  the  <^mal  number  of  features  and  prototypes  required  fw  a 
certain  task.  Results  of  two  experiments  show  ttu  advantages  of  using  the  AFNN  framework. 

1.  Introduction 

Distance-based  classifiers  can  be  very  demanding  from  the  computational  and  storage  aspects  depending  on  two 
factors:  the  number  of  prototypes  and  the  input  dimensionality.  The  reduced  Parzen  classifier  [1],  tlK  learning  vec¬ 
tor  quantization  "LVQ"  [2],  arid  the  conden^  visions  of  the  nearest-neighbor  classifier  [3],  attempt  to  use  a  small 
number  of  prototypes  while  retaining  the  classification  c^itimality.  Aldiough  these  techniques  may  lead  to  signifi- 
amt  reduction  in  complexity  over  traditional  methods  such  as  the  nearest  neighbor  classifier  'NNC.  and  the  Parzen 
witulow  classifier  [1],  greater  reduction  can  still  be  obtained  by  reducing  the  input  dimensionality.  Not  only  does 
the  large  input  dimensionality  add  complexity,  it  also  deters  the  performance,  eqtecially  for  small  data  sets  [4].  This 
drawback  ttuiy  be  overcome  by  extracting  a  small  s^  of  features  from  the  original  attributes,  without  losing  the 
discriminative  information  in  the  data  Feature  extraction  techniques  vary  according  to  their  main  objective,  which 
may  be  either  compression  or  classification.  The  Karhunen-Loeve  transform  "KL'T*  is  (^rtimal  in  data  compression 
applicatians  such  as  transform  coding,  where  uncorrelated  features  result  with  only  few  o(  them  containing  most  of 
the  ptobsibility  information  required  to  reconstruct  the  data  [1].  The  KLT,  however,  may  be  far  from  qptimal  for 
clarification,  since  it  does  not  address  the  issue  of  discrimination  between  classes.  This  fact  is  demonstrated  here, 
as  well  as  in  [1]. 

fri  this  paper  an  adtqttive  feature  extraction  nearest  neighbor  classifia-  "AFNN”  is  proposed,  aiming  to  overcome 
the  above  drawbacks.  It  is  composed  of  two  parts.  The  first  part  is  an  ad^tive  feature  extractor  "AFE”,  with  a  linear 
singular  transform  from  the  original  I  dimension  to  a  smaller  L  dimension.  The  second  part  is  an  adaptive  nearest 
neighbor  classifier  "ANNC"  [5,6],  which  operates  on  the  L-dimensional  q)ace,  and  has  a  codebook  of  Kj  prototypes 
for  the  ja,  class.  Both  the  ttansfcxm  weights  of  the  AFE  and  the  prototype  parameters  of  the  ANNC  are  learned 
together,  starting  from  random  values,  by  maximizing  the  mutual  information  of  the  overall  classifier.  The  MMI 
learning  is  used  since  it  directly  minimizes  an  upper  bound  of  the  classifier’s  probability  of  error  [7].  For  the  AFNN 
architecture  to  be  defined  we  have  to  estimate  two  unknowns,  namely:  the  optimal  number  of  extracted  features,  and 
the  optimal  number  of  prototypes.  Following  the  Bayesian  model  selection  fiamewoik  [8],  we  derive  an  expression 
fw  the  stochastic  complexity  criterion  for  classific^on  "SCC"  in  the  AFNN  classifier.  This  allows  us  to  compare 
different  combinations  of  the  number  of  features  and  {uototypes,  and  select  the  best  one:  with  the  least  SCC.  The 
AFNN  is  tested  with  two  classification  experiments.  A  2-dimensional  synthetic  problem,  which  demonstrates  the 
drawbacks  of  the  KLT  in  classification,  and  a  16-dimension  printed  letter  recognition  |uobl«n  between  the  letters  I 
andJ. 
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2.  Adaptive  Feature  Extractioa  Nearest  Neighbor  ClaasFier:  AFNN 

The  AFNN  is  a  hybrid  architecture  which  consists  of  two  parts:  the  adaptive  feature  extractor  ’AFE”,  and  the 
adaptive  nearest  neighbor  classifier  ”ANNC*  as  shown  in  figurefl)  for  a  2-cIass  case. 


rig(l):  The  AFNN 

In  the  AFE,  the  /-dimensional  input  vector  x  is  transformed  by  a  linear  singular  transform  to  the  /.-dimensional  hid¬ 
den  vector  y.  This  hidden  vector  is  the  reduced  dimensionality  feature  vector,  which  is  used  as  the  input  pattern  to 
the  ANNC.  The  linear  transform  w  has  (/i.)  weights,  where  die  weight  connecting  the  x,-  input  to  the  y/  hidden  node 
or  feature  is  denoted  by  W|j.  When  the  input  pattern  is  ai^lied,  the /«  feature  is  given  by: 

/ 

yi(n)=  I.  ws  Xi{n)  (1) 

1*1 

The  ANNC,  as  discussed  in  [5,6],  is  a  nearest  neighbor  classifier,  with  a  small  number  of  prototypes  per  class  which 
adapt  their  locations  to  maximize  the  mutual  information  criterion,  for  a  given  set  of  training  data.  For  a  two-class 
case,  let  die  codebotdc  of  die  ANNC  contains  fCi  and  K2  prototypes  for  class  1  and  2  reqiecdvely.  Each  prototype  is 
an  L-dimensional  vector  with  each  component  denoted  by  m^-,  where  j  is  the  class  index,  and  /  denotes  the  com¬ 
ponent  in  the  vector.  When  an  input  pattern  x  is  applied,  a  feature  pattern  y  results,  and  the  nearest  prototypes  to 
that  pattern  (in  the  Euclidean  sense)  are  mj  where  J  is  I  and  2  few  b^  clas^  reflectively,  which  may  also  called 
the  wiiHiCT  prototypes.  A  (vobabilistic  amnoximation  of  the  ANNC  presented  in  [6],  assumes  that  each  prototype  is 
the  center  (mean  vectmr)  of  a  Gaussian  window,  and  that  fm’  a  given  pattern  y,  each  class  probability  density  function 
is  approximated  by  die  Gaussian  centered  at  the  winner  prototype.  For  simplicity,  we  assume  that  all  Gaus- 
sians  are  radially  symmetric  with  equal  standard  deviation  o.  In  that  case,  the  winner  prototype  with  the  least  dis¬ 
tance  to  the  input  pattern,  coiresponds  to  the  Gaussian  with  highest  value.  Following  this  formulation,  the  7^  class 
probability  density  a^iroximation  for  the  feature  vector  y,  is  given  by: 

—  1  ^ 

P(y  \QjMj,Cj)  =  (2irj^)  ^  exp  -  Z  (  y,  -  m,j  f  (2) 

where  7  is  1  or  2,  and  my  is  the  la,  component  of  the  winner  prototype  in  the  ja,  class.  Now  let  us  consider  how  the 
AFNN  can  be  used  to  perform  optimal  classitication.  The  linear  transform  maps  the  data  of  the  Jo,  class  to  Kj  clus¬ 
ters  in  the  feature  fiace,  where  Kj  is  the  number  of  prototypes  in  that  class  codebook.  The  adaptation  of  the 
transform  weights,  and  these  cluster  centos,  is  aimed  to  make  the  clusters  of  opposite  classes  linearly  sqiarable,  so 
that  the  ANNC  classifier  can  form  optimal  piece-wise-linear  decision  boundaries.  The  capability  of  the  AFNN 
dqiends  on  the  number  of  featioes  L,  and  the  number  of  inototypes  per  class  Kj,  where  there  is  always  a  minimal 
combination  of  both  which  is  sufficient  for  the  classifier  to  be  c^itimal.  In  ordo  to  leam  the  transform  and  the  proto¬ 
type  locations  we  maximize  the  mutual  information  for  the  AFNN  classifier,  employing  the  probabilistic  formula¬ 
tion  presented  above. 
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3.  MMlTrabyatorthcAFNN 

For  the  2-class  case,  the  large  sample  approximation  of  the  mutual  information  for  the  AFNN  is  given  by; 

N  N 

Gj«  =  ^  (  I  log  /»(C,ie.y.>#)  -h  i  log  P(C2\6,y,M) )  (3) 

N  aal  •=! 


where  0  is  the  ANNC  parameter  vector,  and  /*(Cyl6o'«)  is  the  postoior  class  probability  model  for  the  ja,  class, 
given  by: 


f’(Cyl0,y.)  = 


P(y,iejMj,Cj) 

+ /’(y.  102,3/2.^2) 


(4) 


The  MI  defined  in  (3)  is  an  implicit  function  of  the  original  input  x  and  the  transfom  weights  tv.  Since  the  feature 
vector  y  is  their  linear  combination,  (3)  becomes: 


G«,=  (  I  k)g/*(C,l0,ivjt„A/)-h  I  log/'(C2l0,>V4C„3f)) 

Jisl  it«l 


(5) 


where  the  posterin'  class  probability  is  defined  in  (4),  but  with  y  being  substituted  as  a  function  of  x  and  w,  as  given 
in  (1).  The  MI  is  maximized  with  respect  to  the  weights  aitd  the  {wototype  centers  by  a  gradient-ascent  algorithm. 
Fn  a  given  p^tem  jr,  fixmi  the  Jo,  class  the  MMI  weight  updating  equation  is  given  by: 


Awfi  =  Mw 


(f’l+f’2) 


(6) 


where  Pw,  is  the  learning  gain,  /*«  is  the  PDF  of  the  opposite  class,  m;y  and  are  the  /(«  components  of  the  winner 
{Hototype  mean  vectors  fn  the  class  and  the  o(qx>site  k,k  class  respectively.  Now  we  turn  our  attention  to  the 
learning  equations  fn  the  mean  vectors  of  the  wimiing  (vototypes.  Fn  an  input  Xu,  bom  the  class,  there  is  one 
winning  mean  vector  per  class,  namely  my  and  m^,  fiom  the  correct  and  iitcorrect  class  codebooks  respectively.  The 
MMI  updating  equations  for  these  mean  vector  components  are  given  by: 


where  is  the  learning  gain.  These  learning  equations  push  the  mean  vector  either  closer  to  n  forfher  fiom  the 
feature  vector,  for  the  correct  and  incorrect  class  codebodcs  respectively.  It  is  to  be  noted  that  both  weights  and  pro¬ 
totypes  updating  is  done  after  the  whole  epoch  is  inesented  (batch  mode). 

It  is  well  known  that  the  more  complex  the  iitodel  is,  the  more  flexiUe  it  becomes,  and  hence  can  form  arbitrary 
decision  boundaries.  However,  there  is  always  a  certain  model  complexity  level  over  which  it  starts  to  overfit  the 
given  training  data,  resulting  in  poor  generalization  to  new  data.  We  propose  a  stochastic  complexity  criterion  for 
classification  "SCC",  which  is  derived  from  the  Bayesian  model  selection  fiamework,  to  find  the  (qrtimal  number  of 
features  and  prototypes  in  the  AFNN. 


4.  SCC  for  the  AFNN 

The  stochastic  com[dexity  criterion  for  classification  [6]  is  given  by:  SCC  =  -log  P(Dc  lAf),  where  A/  denotes 
the  model  imder  consideration,  D,  denotes  the  classifier  data  (i.e.,  the  training  data  patterns  and  their  class  labels), 
and  P(Df  13/)  is  foe  evidence  for  the  classification  data  given  the  model,  which  is  also  the  data-model  likelihood  of 
the  classifier,  and  is  given  by; 

P(D^IA/)  =  /'(0,h'I3/)  n  P(Cjlx„,e,wM)  de  dw  (8) 

where  7  is  the  class  index  of  foe  data  pattern  x„,  and  F(0,ivl3/)  is  the  inior  probability  distribution  of  the  parameters 
of  the  model,  which  are  namely,  0  of  the  ANNC  part  and  w  of  the  transform  part  The  SCC  of  the  AFNN  model, 
which  needs  to  be  minimized,  is  approximated  by  [6]: 
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1  11^1 

see  =  -  G«,(e„,,w«,)  +  log  dcu  I(wmi)  1  -  [log(2n)  -  logM  +  DSee^  (9) 


where  I  is  the  number  of  weights,  iV  is  the  number  of  training  patterns,  and:  (1)  Gihj(Bui,wim)  is  the  mutual  infor¬ 
mation  of  the  AFNN  classifier,  at  the  MMI  parameters  estimates,  (2)  is  the  observed  Fisher  information 

matrix  for  the  weights,  which  is  a  symmetric  matrix  of  dimension  IH^I.  If  we  order  the  weights  of  the  adaptive 
transform  network  in  a  vector  of  a  dimension  IVi'l,  then  the  miu,  element  of  the  Fisher  matrix,  fw  the  AFNN  archi¬ 
tecture,  can  be  simplified  to: 


*"  N  *=i  Sw,- 


(10) 


where  Q=PilP2  fw  class-1  data,  and  its  inverse  for  cIass-2  data.  From  (10),  the  second  derivatives  needed  for  the 
Fisher  matrix  can  be  computed  by  the  gradient  information  ^lectly,  which  are  already  available  from  the  learning 
equations,  hence  simplifying  the  computational  complexity  of  die  SCC  significanUy,  and,  (3)  The  D5CC^.„  is  the 
disa<^  SeX!  approximation  for  the  ANNC  part  of  the  classifier,  which  is  given  in  [5,6].  Note  that  uniform  prior  pro¬ 
babilities  for  all  the  parameters  were  assumed. 

Many  combinations  of  the  number  of  features  and  prototypes  for  the  AFNN  are  trained  by  the  MMI,  their  SCC 
are  computed,  and  the  one  with  the  least  SCC  is  selected  as  the  best  candidate  among  the  competing  ones. 


5.  Experimental  Results 

Two  2-class  classification  experiments  were  used  to  test  the  AFNN.  Here,  the  results  emphasize  on  the  ability  of 
the  SCC  to  find  the  optimal  AFNN  model,  and  on  the  comparison  between  these  different  models. 


5.1.  Experiment! 

This  experiment  is  designed  to  demonstrate  the  inability  of  the  KLT  to  perform  useful  dimensionality  reduction 
in  the  context  of  classification.  Hoe  we  have  two  classes  of  data,  with  100  patterns  each,  as  shown  in  figure(2). 

Fig(2):Eipeniimd(l),Onfin«lDataDiiiribacioa  Rg(3);  E«pcf  imcirtC  I ),  KLT-Twuafijmaed  Data 


The  first  class  has  one  cluster,  while  the  second  has  two  clusters,  and  the  two  classes  are  totally  separable,  however, 
nonlinearly,  and  the  optimal  attainable  classification  performance  is  100%.  The  KLT  is  applied  to  this  data,  which  is 
a  2  by  2  matrix  of  entries  [0.707, 0.707, 0.707,  -0.707],  and  the  resulted  transformed  data  with  is  shown  in  figure(3). 
From  diis  figure  we  see  clearly  that  if  one  tries  to  choose  only  one  KLT-extiacted  feature,  the  two  classes  would  be 
grossly  overlrqiped.  Mote  specifically,  if  the  yi-axis  feature  is  extracted,  class- 1  data  and  class-2  upper-cluster  data 
will  be  projected  on  the  yi  axis  on  top  of  each  others,  and  a  very  poor  classification  results.  Similarly,  if  y2-axis 
feature  is  extracted,  class-1  data  and  the  right-most  cluster  of  class-2  will  be  projected  on  the  chosen  axis  on  top  of 
each  others,  and  er  poorer  classification  is  obtained. 
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Hie  AFNN  was  first  applied  with  two  extracted  features  (similar  to  die  KLT),  and  two  (Mototypes  per  class.  The 
resultant  tiansfnm  matrix  has  entries^  [0.7788.  -0.326,  -0.49S,  0.197],  which  are  totally  diffoent  from  the  KLT 
weights.  The  resulted  transformed  data  i??  shown  in  figure(4),  where  AFNN(2.2)  means  2  features  and  2  prototypes 
are  used.  It  is  clear  that  class-2  data  is  transformed  to  a  single  cluster,  and  that  the  two  classes  sav  linearly  separable. 


yl  y] :  Smgk  Petimt  Axm 


This  obviously  means  that  there  exists  a  line  which  when  this  transformed  data  is  projected  on,  they  are  still  separ- 
aMe,  i.e.,  only  one  extracted  feature  is  sufficienL  Now,  we  apply  an  AFNN  with  only  one  feature  and  two  proto¬ 
types  per  cla^.  Indeed,  the  AFNN  obtained  an  optimal  100%  classification  solution,  with  weights  [0.843,-0.533], 
where  die  data  is  projected  along  a  line  where  they  are  separate,  as  shown  in  figure(S).  where  each  jnojected  data 
point  is  rqiresent^  by  an  impulse.  This  experiment  shows  clearly  that  the  KLT  cannot  in  general  be  used  for  clas¬ 
sification  purposes,  since  it  is  not  designed  to  do  so.  Instead,  we  should  use  a  classification-miented  feature  extrac¬ 
tion  transform,  such  as  the  AFNN. 

5-2.  Experiment  2: 2-Letter  Problem 

This  experiment  is  a  {xinted  letter  recognition  task,  where  16-dimensional  patterns  for  the  letters  I  and  J  are  used. 
These  16  dimensions  are  high-level  extracted  attributes  as  discussed  in  [10],  however,  as  the  results  show,  much  less 
number  of  ad^vely  extracted  features  is  suffrcient.  In  this  problem,  we  also  used  100  patterns  po-  class  for  training 
and  400  per  class  for  testing,  and  the  AFNN  is  compared  to  the  MLP,  the  LVQ,  the  probabiUstk  neural  network 
"PNN".andtheNNC. 
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Table  1:  Experiment  2,  Summary  of  Results 


Cbiiifwr 

APNN(23) 

MLPtlO) 

MLP(20) 

MLP(40) 

LVQ(3) 

NNC 

PNN 

«Teft 

93 

90.4 

903 

906 

90.6 

92.9 

93 

•Pknmeten 

46 

190 

380 

760 

96 

3200 

3200 

In  tfiis  experiment,  the  AFNN(2.3)  has  the  least  SCC,  thus  is  chosen  here  for  comparison  with  other  classifiers. 
The  transfonned  training  data  with  the  decision  boundary  for  the  AFNN(2.3)  is  shown  in  figuie(6).  This  figure 
shows  how  the  opposite  classes  data  are  transformed  to  piece-wise-linearly  separable  clusters  in  the  feature  qiace. 
The  conclusions  of  this  experiment  are:  (1)  The  AFNN(Z3)  has  outperformed  other  classifiers  in  both  the  p^or- 
mance  and  die  reduction  of  complexity,  and,  (2)  Although  the  MLP,  the  NNC,  and  the  PNN  have  perfmmed  well, 
they  are  inferior  to  the  AFNN  in  the  pcxformance  (MLP),  or  in  complexity  (NNC  and  PNN). 

(.  Sunaary  and  Conchisioas 

A  novel  hybrid  classifia',  the  AFNN,  is  proposed,  which  is  composed  of  two  parts.  The  first  part  is  a  linear 
transform  which  extracts  a  minimum  number  of  features  from  die  input  attributes,  while  the  second  uses  these 
features  in  an  ANNC-type  classifier.  An  MMI  learning  algmithm  is  developed  to  estimate  the  weights  of  the 
transform  and  the  ANNC  prototypes  simultaneously.  An  SCC  criterion  is  also  developed  to  estimate  the  optimal 
combiiudion  of  features  and  prototypes,  based  on  the  MMI  trained  AFNN.  The  results  show  that  high- 
dimensionality  problems  can  be  mapped  into  much  smaller  dimensionality,  which  results  in  very  compact  ANNC 
stage,  and  very  compact  overall  AFNN  classifiers.  It  also  shows  that,  because  of  the  optimal  training  used,  the 
AFNN  performs  as  w  better  than  the  traditiofuil  NNC.  The  conclusions  of  this  paper  are:  The  KLT  should  not, 
in  gen^,  be  used  for  supervised  classification  purposes  when  dimensionality  reduction  is  sought.  It  is  argued 
intuitively,  and  shown  experimentally,  that  the  KLT  may  lead  to  very  poor  classification  results,  for  certain  prob¬ 
lems.  The  AFNN,  on  the  odier  hand,  is  designed  to  find  a  set  of  features,  which  are  optimal  from  the  classification 
perspective,  thus  will  not  suffer  from  the  KLT  proUems.  It  was  surprising  to  find  that  only  very  few  extracted 
features  are  sufficient  for  a  classification  performance  as  well  as,  or  better,  than  a  classical  classifier  such  as  the 
NNC.  If  this  trend  is  valid  for  many  large-dimensionality  classification  inobiems.  it  may  prove  to  be  extremely  use¬ 
ful  in  many  ways.  Firstly,  the  overall  learning  time  is  srnall.  And  secondly,  the  overall  size  is  minimal,  since  it  uses 
the  minimal  number  cf  features,  and  the  minimal  numbo-  of  {uototypes.  Finally,  the  developed  SCC  criterion  is  suc¬ 
cessful  in  estimating  the  best  combination  of  the  number  of  feiUures  and  {xototypes  among  many  trained  AFNN 
architectuies. 
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Abstract 

A  recently  proposed  neural  network  architecture,  the  parallel  consensual  neural  network,  is  applied 
in  classification  of  data  from  multiple  data  sources.  The  parallel  consensual  neural  network  (PCNN) 
architecture  is  based  on  statistical  consensus  theory  and  involves  using  stage  neural  networks  with  either 
non-linearly  transformed  input  data  or  different  initializations  for  the  stage  networks.  When  non-linear 
transformations  are  applied,  the  input  data  are  transformed  several  times  and  the  different  transformed 
data  are  used  as  if  they  were  independent  inputs.  The  independent  inputs  are  classified  using  stage 
neural  networks  and  the  outputs  from  the  stage  networks  are  then  weighted  and  combined  to  make  a 
decision.  Optimization  methods  are  proposed  to  compute  the  wdghts  for  the  stage  networks.  The  given 
experimental  results  show  the  superiority  of  the  optimization  approach  as  compared  to  conjugate-gradient 
backpropagation  in  classification  of  test  data. 


1  Introduction 

The  recent  resurgence  of  research  in  neural  networks  has  resulted  in  the  development  of  new  and  improved 
neural  network  models.  These  new  models  have  been  trtiined  successfully  to  classify  complex  data.  In 
pattern  recognition  applications,  the  question  of  how  well  neural  network  models  perform  as  classifiers 
is  very  important.  In  previous  papers  [1],[2],  it  has  been  shown  that  neural  networks  compared  well  to 
statistical  classiiicauon  methods  in  classification  of  multisource  remote  sensing/geographic  data  and  very- 
high-dimensional  data.  The  neural  network  models  were  superior  to  the  statistical  methods  in  terms  of  overall 
classification  accuracy  of  training  data.  However,  statistical  methods  based  on  consensus  from  several  data 
sources  outperformed  the  neural  networks  in  terms  of  overall  classification  accuracy  of  test  data.  Thus 
it  would  be  very  desirable  to  combine  certeun  aspects  of  the  statistical  consensus  theory  approaches  and 
the  neural  network  models.  However,  it  is  very  difficult  to  implement  statistics  in  neural  networks.  In  [3] 
parallel  consensual  neural  networks  (PCNNs)  were  proposed  and  implemented  as  stage-wise  neural  network 
algorithms.  The  network  models  in  [3]  do  not  use  prior  statistical  information  but  are  somewhat  analogous 
to  the  statistical  consensus  theory  approaches.  In  this  paper  the  methods  proposed  in  [3]  are  extended  to 
include  optimal  weights  for  the  stage  networks.  The  paper  begins  with  a  short  overview  of  consensus  theory 
followed  by  a  discussion  of  the  PCNNs.  Finally,  experiment^d  results  are  given. 

*This  research  is  supported  in  part  by  the  Icelandic  Council  of  Science,  the  National  Aeronautics  and  Space  Administraticm 
Contract  No.  NAGW-925  and  the  Research  F\md  of  the  University  of  Iceland 
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2  Consensus  Theory 

Consensus  theory  [4], [5]  is  a  well-established  research  field  involving  procedures  for  combining  single  prob¬ 
ability  distributions  to  summarize  estimates  from  multiple  data  sources  with  the  assumption  that  the  data 
sources  are  Bayesian.  In  most  consensus  theoretic  methods,  the  data  from  each  source  are  at  first  classified 
into  a  number  of  source-specific  data  classes  [1].  The  information  from  the  sources  is  then  aggregated  by  a 
global  membership  function  and  the  data  are  classified  according  to  the  usual  maximum  selection  rule  into 
a  number  of  user-specified  information  classes.  The  combination  formula  obtained  in  consensus  theory  is 
called  a  consensus  rule.  Several  consensus  rules  have  been  proposed.  Probably  the  most  commonly  used 
consensus  rule  is  the  linear  opinion  pool  which  has  the  following  form  for  the  information  class  if  n  data 
sources  are  used: 

Ci{Z)  =  (1) 

i=l 

where  Z  =  [zi , . . . ,  z„]  is  a  pixel,  p(uj  |2,-)  is  a  source-specific  posterior  probability  and  Oi ’s  (t  =  1 , . . . ,  n)  are 
source-specific  weights  which  control  the  relative  influence  of  the  data  sources.  The  weights  are  associated 
with  the  sources  in  the  global  membership  function  to  express  quantitatively  our  confidence  in  each  source 
[4].  The  linear  opinion  pool  is  simple  but  has  several  shortcomings,  e.g.,  it  is  not  externally  Bayesian  since  it 
is  not  derived  from  class-conditional  probabilities  using  Bayes’  rule.  Another  consensus  rule  which  overcomes 
the  shortcomings  associated  with  the  linear  opinion  pool  is  the  logarithmic  opinion  pool: 

Li(Z)  =  ■  (2) 

t=l 

The  logarithmic  opinion  pool  has  performed  well  in  classification  of  data  from  multiple  sources  [4]. 

It  is  desirable  to  implement  consensus  theoretic  approaches  in  neural  networks  since  consensus 
theory  has  the  goal  of  combining  several  opinions  and  a  collection  of  different  neural  networks  should  be 
more  accurate  than  a  single  network  in  classification.  It  is  important  to  note  that  neural  networks  have  been 
shown  to  approximate  class-conditional  probabilities,  p{uj\zi),  at  the  output  in  the  mean  square  sense  [6]. 
Using  this  property  of  neural  networks  it  becomes  possible  to  implement  consensus  theory  in  the  networks. 

3  Neural  Networks  with  Parallel  Stages 

A  block  diagram  of  the  parallel  consensual  neural  network  (PCNN)  architecture  is  shown  in  Figure 
1.  Each  stage  neural  network  (SNN)  has  the  same  number  of  outputs  neurons  as  the  number  of  information 
classes  and  is  trained  for  a  fixed  number  of  iterations  or  until  the  training  procedure  converges.  When  the 
training  of  the  first  stage  has  finished,  the  classification  error  is  computed.  Then  another  stage  is  created. 
The  input  data  to  the  second  stage  are  obtained  by  non-linearly  transforming  (NLT)  the  original  input 
vectors.  That  stage  is  trained  in  a  fashion  similar  to  the  first  stage.  When  the  training  of  the  second  stage 
has  finished,  the  consensus  for  the  SNNs  is  computed.  TLo  consensus  is  obtained  by  taking  class-specific 
weighted  averages  of  the  output  responses  of  the  SNNs  using  source-specific  weights  [4],  simileir  to  the  ones  in 
equations  (1)  and  (2).  Error  detection  is  then  performed  and  the  consensual  classification  error  is  computed. 
In  neural  networks  it  is  very  important  to  find  the  "best"  representation  of  input  data  and  the  consensual 
method  attempts  to  average  over  the  results  from  several  input  representations  or  different  initializations  for 
the  stages.  Also,  in  the  consensual  neural  networks,  classification  of  test  data  can  be  done  in  parallel  with  all 
the  stages  receiving  data  simultaneously,  which  makes  this  method  attractive  for  implementation  on  parallel 
machines. 

The  PCNN  is  self-organizing  in  the  following  sense:  If  the  consensual  classification  error  is  lower  than 
the  classification  error  for  the  first  stage,  another  stage  is  created  and  trained  in  a  way  similar  to  the  second 
stage,  but  with  emother  non-linear  transformation  of  the  input  data  or  emother  initialization  of  the  stage 
neural  network.  Stages  are  added  in  the  consensual  neural  network  as  long  as  the  consensual  classification 
error  decreases  or  a  tolerance  limit  is  reached.  If  the  consensual  classification  error  is  not  decreasing  or  is 
lower  than  the  tolerance  limit,  the  training  is  stopped.  Using  this  architecture  it  can  be  guaranteed  that  the 
PCNNs  should  do  no  worse  that  single  stage  networks,  at  least  in  training.  To  be  able  to  guarantee  such 
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Figure  1;  Parallel  consensual  neural  network  architecture 

performance  in  classification  of  test  data,  cross-validation  methods  can  be  used.  Also,  it  has  been  shown  [7] 
that  if  all  the  networks  in  a  collection  of  neural  networks  arrive  at  the  correct  classification  with  a  certain 
likelihood  1  —  p  and  the  networks  make  independent  errors,  the  chances  of  seeing  exactly  k  errors  among  N 
copies  of  the  network  is; 

( J  )p‘(i->>r‘ 

which  gives  the  following  likelihood  of  a  sum  of  network  outputs  being  in  error: 

5:  ( j)p‘(i-pr‘ 

t>n/2  ' 

which  is  monotonically  decreasing  in  IV  if  p  <  1/2.  This  implies  that  using  a  collection  of  networks  reduces 
the  expected  classification  error  if  the  networks  have  equal  weights  and  make  independent  errors.  It  has  also 
been  shown  [8]  that  the  standard  deviation  of  the  classification  of  a  portfolio  of  neured  networks  (such  as  the 
PCNN)  decreases  as  the  number  of  stage  networks  increase. 

In  [3]  two  versions  of  the  PCNN  were  proposed.  Both  PCNNs  combine  the  information  from  separate 
inputs  and  can  be  considered  neural  network  implementations  of  the  consensus  rules  in  equations  (I)  and 
(2).  Here  we  concentrate  on  the  PCNNS,  the  consensual  neural  network  version  of  the  linear  opinion  pool 
which  will  be  referred  to  below  as  the  PCNN. 

Related  neural  network  architectures  to  the  PCNN  have  been  proposed  by  Hansen  and  Salamon  [7], 
Ersoy  and  Hong  [9],  Deng  and  Ersoy  [10],  Valafar  and  Ersoy  [11],  Alpaydin  [12],  and  Nilson  [13].  However, 
the  PCNN  architecture  is  different  from  all  of  these.  It  uses  non-linear  transformation  between  stages  and 
weights  the  output  from  all  the  SNNs. 

4  Optimal  Weights 

The  weight  selection  schemes  in  the  PCNN  should  reflect  the  goodness  of  the  separate  input  data,  i.e., 
relatively  high  weights  should  be  given  to  input  data  that  can  be  classified  with  good  accuracy.  There  are 
at  least  two  possible  weight  selection  schemes.  The  first  one  is  to  select  the  weights  such  that  they  weight 
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Figure  2:  PCNN  with  weighted  individual  stages 

the  individual  stages  but  not  the  classes  within  the  stages.  This  scheme  is  shown  in  Figure  1.  In  this  case 
one  possibility  is  to  use  equal  weights  for  all  the  outputs  of  the  SNNs  and  effectively  take  the  average  of  the 
outputs  from  the  SNNs.  Another  possibility  is  to  use  reliability  measures  which  rank  the  SNNs  according 
to  their  goodness.  These  reliability  measures  are,  e.g.,  stage-specific  classification  accuracy  of  training  data, 
overall  separability  and  equivocation  [1]. 

The  second  scheme  is  to  choose  the  weights  such  that  they  not  only  weight  the  individual  stages 
but  also  the  classes  within  the  stages.  This  scheme  is  depicted  in  Figure  2.  In  the  case  of  the  PCNN  the 
combined  output  response  Y  can  be  written  in  a  matrix  form  as 

Y  =  XW 

where  X  is  a  matrix  containing  the  output  of  all  the  SNNs  and  W  contains  all  the  weights.  Assuming  that 
X  has  full  column  rank,  the  above  equation  can  be  solved  for  W  using  the  pseudo-inverse  of  X  or  a  simple 
delta  rule. 

Let’s  now  look  at  the  problem  of  choosing  the  weights  such  that  they  not  only  weight  the  individual 
stages  but  also  the  classes  within  the  stages.  In  order  to  find  the  optimal  weights  in  Figure  2  we  define 

X  =  [Xi  X2  ...  x„], 

Wi 

W2 

w„ 

where  X,-  »  =  1,  ...  ,n  are  m  x  p  matrices.  Each  row  of  X,-  represents  an  output  vector  of  each  stage 
network  SNN  i.  Wj  i  =  1,  ...  ,  n  are  p  x  p  matrices  representing  the  weight  of  each  stage  network  SNN  i. 
If  y  =  D  is  the  desired  output  of  the  whole  network  we  have 

XW  =  D. 

ly  is  an  unknown  matrix,  and  its  least  square  estimate  W^t  is  sought  to  minimize  the  square  error 

||xvy  -  £>||^ 
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This  is  a  weil  known  problem  from  linear  regression,  signal  processing  and  adaptive  filters.  The  formula  for 
Wopt  uses  the  pseudo-inverse  of  W,  i.e., 

Wopt^(X'^X)-^X'^D 

where  X^  is  the  transpose  of  X,  and  (X^ X)~^X^  is  the  pseudo-inverse  of  X  if  X^X  is  non-singular.  In 
the  case  that  X  is  not  of  full  column  rank  this  solution  becomes  ill-conditioned.  In  that  case  one  can  use 
dummy  augmentation  to  make  W  a  full  column  rank  matrix  in  a  higher  dimensional  space  and  then  solve 
the  problem.  There  are  at  least  two  other  suboptimal  methods  for  solving  the  optimization  problem  above. 
The  rest  of  this  section  will  be  denoted  to  these  methods. 

The  first  method  is  to  use  sequential  formulas  to  compute  the  optim2d  W.  Let  the  ith  row  vector  of 
the  matrix  X  be  xj  and  the  ith  row  of  the  matrix  D  be  df,  then  W  can  be  calculated  iteratively  using  the 
sequential  formula 


Wi+i 

Pi^i 


Wi  Pi+iXi+1  (df^.,  - 
p  _  Pj^i+l^T+lPj 
i  +  ^T+lPi^i+1 


i  =  0,  1,  . . . ,  m 


where  Wm  is  the  least  square  estimate  of  Wapt-  The  initial  conditions  to  the  sequential  formula  are  1^0  =  0 
and  Pq  —  01,  where  /3  is  a  positive  large  number. 

The  second  method  solving  the  least  square  error  problem  is  to  choose  unitary  W  which  minimizes 
1|Z)  -  XIV1|.  We  compute 

||D -  XWf  =  1|/)(|2  -2  <D, XW  >  +11X1(2 


where  <  D,XW  >=  tT(DW^ X^)  and  tr  returns  the  trace  of  of  its  argument  matrices.  If 


is  a  singular  value  decomposition  (SVD)  of  X^D  then 


tr(Z)W^X’’) 


ttiX'^DW'^) 

tr(VEt/’’W^) 

tr(Ef/2’W^V') 


where  T  =  [/y]  =  U^W^V  is  a  unitary  matrix.  This  sum  is  maximized  when  all  t,-,-  =  1,  that  is  when 

W,pt  =  Vt/2’. 


5  EXPERIMENTAL  RESt 


The  PCNNs  were  sed  to  classify  a  data  set  consisting  of  the  following  4  data  sources; 

1.  Landsat  MSS  data  (4  data  channels) 

2.  Elevation  data  (in  10  m  contour  intervals,  1  data  channel) 

3.  Slope  data  (0-90  degrees  in  1  degree  increments,  1  data  channel). 

4.  Aspect  data  (1-180  degrees  in  1  degree  increments,  1  data  channel) 

Each  channel  comprised  an  image  of  135  rows  and  131  columns,  all  channels  were  co-registered.  The 
area  used  for  classification  is  a  mountainous  area  in  Colorado.  It  has  10  ground-cover  classes  which  are  listed 
in  Table  1.  One  class  is  water;  the  others  are  forest  types.  It  is  very  difficult  to  distinguish  between  the 
forest  types  using  the  Landsat  MSS  data  alone  since  the  forest  classes  show  very  similar  spectral  response 
(IJ.  Reference  data  were  compiled  for  the  area  by  comparing  a  cartographic  map  to  a  color  composite  of 
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Table  1;  Training  and  Test  Samples  for  Information  Classes  in  the  Experiments  on  the  Colorado  Data  Set 


Class  # 

Information  Class 

IVaining  Size 

Test  Size 

1 

Water 

301 

302 

2 

Colorado  Blue  Spruce 

56 

56 

3 

Mountane/Subalpine  Meadow 

43 

44 

4 

Aspen 

70 

70 

5 

Ponderosa  Pine  1 

157 

157 

6 

Ponderosa  Pine/Douglas  Fir 

122 

122 

7 

Engelmann  Spruce 

147 

147 

8 

Douglas  Fir/White  Fir 

38 

38 

9 

Douglas  Fir/Ponderosa  Pine/Aspen 

25 

25 

10 

Douglas  Fir/White  Fir/Aspen 

49 

50 

Total 

1008 

1011 

the  Landsat  data  and  also  to  a  line  printer  output  of  each  Landsat  channel.  By  this  method  2019  reference 
points  (11.4%  of  the  area)  were  selected  comprising  two  or  more  homogeneous  fields  in  the  imagery  for  each 
class.  It  has  been  shown  [2], [3]  that  neural  networks  are  sensitive  to  having  representative  training  samples. 
In  order  to  see  how  well  the  PCNNs  compared  to  a  backpropagation  neural  network  with  a  representative 
training  sample,  the  training  samples  were  selected  uniformly  spaced  apart  in  the  experiments.  Around 
50%  of  the  samples  were  used  for  training  and  the  rest  to  test  the  neural  networks  (see  Table  1).  Two 
versions  of  the  PCNN  were  applied  in  classification  of  the  Colordado  data,  i.e.,  PCNN  with  equal  weights 
and  optimized  weights.  (The  optimal  approach  reported  here  was  the  inverse  method  but  the  suboptimal 
methods  gave  similar  results.)  The  PCNN  algorithms  were  implemented  using  one-layer  coiyugate-gradient 
delta  rule  neural  networks  (CGLC)  [2],[14]  as  its  SNNs.  The  coqjugate-gradient  versions  of  the  feedforward 
neural  networks  are  computationally  more  efficient  than  conventional  gradient  descent  neural  networks.  The 
original  input  data  were  Gray-coded  but  that  representation  has  previously  given  the  best  results  for  this 
particular  data  set  [2].  Using  the  Gray-code  and  8  bits  for  each  input  stage  expanded  the  dimensionality 
of  input  data  to  56  dimensions.  Therefore,  each  SNN  had  57  inputs  (one  extra  input  for  the  bias),  and  10 
outputs.  In  these  experiments  the  Gray-code  of  the  Gray-code  was  the  non-linear  transformation  selected. 
This  is  the  same  non-linear  transformation  used  in  [3].  Each  SNN  was  trained  for  200  iterations.  In  order 
to  get  comparison  to  the  results  of  the  PCNN,  the  single-stage  conjugate-gradient  b2u;kpropagation  (CGBP) 
algorithm  with  two  layers  [14]  was  trained  on  the  same  data  with  a  V2triable  number  of  hidden  neurons. 
The  CGBP  neural  networks  had  57  inputs,  8,  16,  24  and  32  hidden  neurons  and  10  output  neurons.  Eleven 
experiments  were  run  for  the  PCNN  with  different  numbers  of  stages.  The  highest  number  of  SNNs  used 
in  each  PCNN  was  fifteen.  All  the  neural  networks  used  the  sigmoid  activation  function.  The  experiments 
were  run  on  a  SUN  SPARCstation  10/41. 

The  average  results  of  the  experiments  with  the  PCNN  are  shown  in  Figure  3  for  the  two  weight 
selection  schemes  and  the  standard  deviation  of  the  training  accuracy  for  the  PCNNs  is  shown  in  Figure  4. 
The  results  with  the  CGBP  (for  different  number  of  hidden  neurons)  are  shown  in  Figure  5  as  a  function  of  the 
number  of  training  iterations.  From  these  figures  it  is  clear  that  the  PCNN  methods  outperformed  the  single 
stage  CGBP  in  terms  of  classification  accuracy  of  test  data.  Also,  the  difference  between  the  equal  weight 
selection  and  the  optimal  weighting  method  became  very  clear  in  the  experiments.  The  optimal  approach 
clearly  outperformed  the  equal  weighting  approach  in  terms  of  training  stccuracy.  In  fact,  for  training  data, 
the  optimal  weighting  approach  did  show  monotonically  increasing  overall  accuracy  as  a  function  of  the 
number  of  stages.  This  result  was  expected  since  the  weights  in  the  PCNN  were  optimized  based  on  the 
training  data.  On  the  other  hand,  the  PCNN  methods  showed  very  similar  test  accuracies  after  15  stages. 
On  the  average,  the  optimal  approach  achieved  80.77%  overall  accuracy  for  test  data  as  compared  to  80.74% 
for  the  equal  weighting  approach.  In  comparison,  the  CGBP  method  achieved  the  maximum  accuracy  of 
77%  for  test  data.  It  is  also  important  to  note  that  the  test  results  with  both  PCNNs  are  better  than  the 
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best  statistical  result  achieved  in  [4]. 

As  Figure  4  displays,  the  standard  deviation  of  the  classification  went  down  as  a  function  of  the 
number  of  stages  used.  Overall,  the  PCNN  results  in  these  experiments  were  very  satisfying. 

6  CONCLUSIONS 

In  this  paper  optimised  weights  were  computed  for  the  stage  networks  in  the  PCNN  architecture.  The  results 
obtained  showed  the  PCNN  architecture  to  be  a  desirable  alternative  to  coiyugate-gradient  backpropagation 
for  multisource  classification  when  representative  training  data  are  available.  The  results  for  the  PCNN 
outperformed  all  other  methods  (applied  now  and  previously  on  the  data  set  used)  in  terms  of  classification 
accuracy  of  test  data.  The  results  using  the  optimized  weights  were  very  promising  and  it  is  important  to 
note  that  the  new  optimized  weighting  approach  can  also  be  used  for  the  networks  proposed  in  [11]  and  [12]. 
Although  binary  input  data  were  used  in  the  experiments,  the  PCNN  with  optimized  weights  works  l^th 
for  analog  and  binary  input  data. 

At  this  point,  the  PCNNs  require  to  be  tested  extensively.  Different  non-linear  transformations  and 
the  various  weight-selection  schemes  proposed  here  need  to  be  explored  more  thoroughly.  Also,  different  types 
of  PCNN  architectures  are  being  investigated.  These  architectures  include  PCNNs  with  different  non-linear 
transforms  for  each  stage  and  different  number  of  iterations  for  the  stages.  The  most  important  remaining 
problem  in  the  research  concerning  the  PCNN  architecture  is  the  selection  of  the  non-linear  transformations. 
In  this  paper  we  did  not  concentrate  on  that  problem  but  used  somewhat  an  ad  hoc  method,  i.e.  the  Gray- 
code  of  the  Gray-code.  Using  an  optimal  non-linear  transformation  could  be  critical  to  the  performance  of 
the  PCNN. 
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Figure  3:  Average  results  for  the  PCNN  with  equal  and  optimal  weights  as  a  function  of  the  number  of 
SNNs.  The  upper  curves  represent  training  results  and  the  lower  curves  test  results. 
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Nun«iarala««M(SNN«) 

Figure  4:  Standard  deviation  for  the  training  results  of  the  PCNN  methods. 


Number  o(  mining  itarelians 

Figure  5:  Experimental  results  for  the  CGBP  with  a  variable  number  of  hidden  neurons.  The  upper  curves 
represent  training  results  and  the  lower  curves  test  results. 
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In  this  pnpei  we  ptopoae  ui  efficient  nnd  flexible  method  of  finding  the  nearest  neighbor  using  a 
CMAC.  A  subset  of  design  patterns  are  selected  as  probable  candidates  amongst  which  the  nearest 
neighbor  is  searched  for.  This  reduces  the  number  of  distance  computations  compared  to  the  traditional 
approach.  Unlihe  many  other  efficient  techniques,  this  system  can  be  trained  on  additional  design  patterns 
at  a  later  time  without  affecting  the  previous  learning.  Experimental  results  are  presented  to  demonstrate 
the  efficiency  of  the  proposed  approach. 


1  Introduction 

Since  its  formal  introduction  as  a  classification  tool  by  Cover  and  Hart  in  1967  [1],  the  Nearest  Neighbor  (NN) 
rule  is  being  applied  in  a  large  number  of  areas  and  is  still  an  interesting  topic  of  research  [2].  In  its  simple 
form,  it  dew  with  the  problem  of  associating  a  new  pattern  with  the  label  of  any  one  of  the  other  design 
patterns  already  known.  As  the  name  implies,  the  label  of  the  pattern  of  the  nearest  neighbor,  measured 
using  a  suitable  distance  metric,  is  the  one  chosen  to  be  associated  with  the  new  pattern.  An  extension 
of  this  rule  is  the  h-NN  rule  which  considers  the  classes  of  k  nearest  neighbors  before  such  an  association 
is  made.  The  NN  rule  is  not  only  intuitive,  but  is  also  bounded  at  the  upper  limit  by  at  most  twice  the 
Bayesian  error  fl]  which  makes  is  attractive  for  use  with  patterns  whose  distributions  are  not  known.  For 
an  excellent  and  comprehensive  survey  of  NN  techniques,  see  Dasarathy  [3]. 

In  spite  of  serving  as  an  important  nonparametric  method  for  pattern  classification,  the  NN  approach  has 
a  major  problem:  computational  complexity.  The  search  for  the  NN  involve  a  large  number  of  distance 
computations  whose  complexity  increases  with  dimensionality.  As  a  result,  a  number  of  techniques  have  been 
propos^  to  reduce  this  computational  complexity,  most  of  whidi  can  be  classified  under  one  of  the  following 
categories',  “condensed”  approach,“hierarchical”  approach,  “pattern  preprocessing”  approach,  “feature  space 
partitioning”  approach,  and  “neural  network”  approach. 

Among  the  earli^t  of  the  condensed  approaches  is  the  condensed  nezuest  neighbor  (CNN)  rule  [4],  which  is 
a  method  to  derive  a  condensed  set  of  prototype  patterns  that  will  give  the  same  result  at  the  original  set. 
While  this  approach  advocates  a  subset  growing  methodology,  a  similar  method  called  the  reduced  nearest 
neighbor  (RNN)  rule  [5]  advocates  deriving  the  minimal  set  by  starting  off  with  the  complete  original  set 
and  iteratively  deleting  unnecessary  elements.  In  both  approaches  however,  the  goal  of  minimal  subset  is 
not  guarantee. 

Hierarchical  approaches  that  are  usually  referred  to  as  “branch  and  bound”  techniques,  have  to  do  with 
constructing  a  tree  to  cluster  the  data.  The  search  for  the  nearest  neighbor  is  reduced  because  only  a  few 
branches  are  examined.  The  algorithm  first  suggested  in  1975  [6]  have  since  then  undergone  a  number  of 
improvements  [2]. 

Preprocessing  the  data  has  been  often  suggested  as  reducing  the  computational  complexity  of  the  actual 
NN  search.  Sethi  [7]  suggests  an  approach  where  all  the  design  patterns  are  ordered  with  respect  to  their 
distances  from  three  reference  points.  Only  the  patterns  within  a  small  neighborhood  of  the  new  pattern  are 
considered  while  searching  for  the  NN. 

Among  the  earlier  and  simple  methods  of  feature  partitioning  is  the  “cube”  algorithm  of  Yunck  [8].  Here 
the  nearest  neighbor  of  a  new  pattern  is  searched  for  within  a  hraercube  that  surrounds  the  pattern.  Use  of 
a  k-d  trM  |9]  is  a  more  recent  method  used  to  reduce  the  search  complexity.  Petitioning  the  feature  space 
perpendicule  to  each  axis  in  such  a  manner  as  to  maintain  2ui  equal  (or  near  equal)  number  of  patterns  on 
each  side  of  the  partition  is  the  centrail  idea  to  this  scheme.  The  petition  is  done  recursively  until  a  small 
number  of  samples  remain  within  eeh  petition.  An  incremental  seech  starting  from  the  root  node  is  then 
used  for  seech  for  an  ebitrey  set  of  m  neeest  neighbors. 

Recently,  the  fc-NN  rule  has  been  implemented  using  etificial  neural  networks  (ANN)  [10]  built  on  four 
blocks;  matching  network,  k-maximum  network,  counting  network  and  1-maximum  network.  In  the  match¬ 
ing  network  the  training  patterns  ee  stored  in  the  interconnections.  The  matching  scores  between  a  new 
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Figure  1:  Conceptual  uchitecture  of  the  CM  AC 


pattern  and  the  stored  patterns  are  computed  by  this  block  are  fed  into  the  subsequent  layers  to  compute  the 
c-nearest  neighbors.  When  the  number  of  training  patterns  and  the  value  of  k  is  known  apriori,  the  paral¬ 
lelism  offered  Dv  the  network  can  be  exploited  by  hardware.  However  if  these  parameters  are  not  known,  the 
network  has  to  be  redesigned  each  time,  and  simulating  it  in  software  offers  no  advantage  over  the  traditional 
t-NN  classifier, 

In  this  paper,  we  present  a  method  to  reduce  the  complexity  of  search  for  the  nearest  neighbors  by  as¬ 
sociating  with  a  new  pattern,  only  a  small  set  of  design  patterns  amongst  which  to  look  for  the  nearest 
neighbor.  The  selection  of  this  small  set  is  facilitated  by  a  model  called  the  CM  AC  f  Cerebellar  Model  Ar¬ 
ticulation  Controller/  Cerebellar  Model  Arithmetic  Computer)  which  was  introducea  as  a  controller  for  a 
robotic  manipulator  [11].  Later  the  CMAC  found  applications  in  shape  recognition  [12],  and  neural  network 
domains  [13].  We  adapt  the  model  for  the  nearest  neighbor  determination.  The  use  of  CMAC  in  computing 
the  nearest  neighbor  offers  two  major  advantages  -  efficiency  and  flexibility.  Efficiency  is  offered  in  terms 
of  the  CMAC  providing  a  small  subset  of  patterns  for  distance  computations,  while  flexibility  is  offered  in 
terms  of  the  CMAC’s  ability  to  update  its  database  of  desi^  patterns  without  having  to  go  through  the 
"learning”  process  again.  While  a  number  of  methods  described  in  the  literature  to  compute  the  nearest 
neighbor  are  efficient,  few  offer  the  flexibility  of  not  having  to  retrain  the  system  when  new  design  patterns 
have  to  be  added. 

2  The  CMAC 

2.1  Architecture 

The  CMAC  (Cerebellar  Model  Articulation  Controller/  Cerebellar  Model  Arithmetic  Computer)  was  first 
introduced  in  1972  by  James  Albus.  See  [14]  for  a  good  introduction  to  the  model.  It  is  a  serious  attempt  to 
model  the  functionality  of  the  human  brain  while  muntaining  a  simplified  but  close  structural  relationship 
to  it. 

The  CMAC  is  a  system  that  can  be  trained  to  learn  even  nonlinear  transformations.  As  most  neural  network 
models  [15],  it  operates  in  two  phases;  a  training  phase  during  when  a  set  of  patterns  are  presented  to  the 
system,  and  a  testing  phase  when  it  responds  to  the  inputs  presented.  Again,  as  in  a  neural  network,  the 
input  to  output  transformation  is  learnt  by  the  system  and  need  not  be  known  apriori.  The  effectiveness  of 
the  system  is  gauged  by  its  ability  to  provide  correct  or  near  correct  responses  to  slightly  corrupted  inputs. 

In  applications  that  use  the  CMAC,  three  basic  properties  are  exploited.  First,  when  provided  with  an  input 
pattern  similair  to  an  exemplar  pattern  that  was  learnt  in  the  training  phase,  it  produces  an  output  that  is 
similar  to  the  response  that  should  have  been  produced  by  the  exemplar  pattern.  Second,  to  produce  such  an 
output,  it  sums  up  the  contents  of  distributed  memory  locations.  This  distribution  of  information  enhances 
the  robustn^  of  the  system.  Third,  the  CMAC  uses  overlapping  low  resolution  blocks  in  the  input  space 
to  aid  in  its  input /output  mapping. 

These  concepts  are  illustrated  in  Figure  1.  The  input  space  is  divided  into  a  number  of  “blocks” .  In  the  case 
of  the  three  dimensionsJ  input  spetce  shown  in  the  figure,  each  of  the  axis  is  divided  at  regular  intervals,  with 
each  interv2Ll  labelled  with  a  unique  address.  Therefore,  each  block  is  uniquely  addressable.  For  example,  if 
along  each  axis  the  numbering  begins  from  0,  the  address  of  the  block  indicated  by  the  thick  arrow  is  012.  A 


set  of  non-overlapping  blocks  that  partition  the  input  space  is  called  a  level.  Another  set  of  non-overlapping 
blocks  with  a  small  onset  along  each  axis,  can  be  generated  to  encompass  the  same  input  space.  That  would 
constitute  another  level. 

If  the  address  of  a  block  is  determined  by  the  limits  within  which  it  falls  along  each  feature  2ixis,  the  range 
of  block  addresses  can  be  enormous.  This  address  ramge  that  is  often  termed  conceptual  memory,  can  grow 
even  bigger  with  increasing  dimensions  and  increasing  resolutions  of  the  blocks.  In  reality,  an  address  in  the 
conceptual  memory,  callea  virtual  address  is  mapped  to  physical  address  in  physical  memory  by  hashing. 
The  address  of  the  block  into  which  a  pattern  fsills,  is  the  virtual  address  of  the  pattern.  Each  level  is 
assigned  it  own  memory  partition  and  generates  a  virtual  address  for  the  pattern  encountered.  The  output 
corresponding  to  the  pattern  is  distributed  over  hashed  physical  addresses  over  all  the  memory  partitions. 

A  CM  AC  works  as  follows.  When  a  training  pattern  is  presented  to  the  system,  a  virtual  address  for  the 
pattern  is  determined  for  each  level.  This  is  merely  the  address  of  the  block  into  which  the  pattern  falls.  A 
hash  algorithm  would  then  give  a  physical  address  corresponding  to  each  level.  The  current  contents  of  the 
physical  addresses  are  summed  to  give  an  output.  The  difference  between  the  actual  and  current  outputs 
is  then  used  as  a  correction  term,  a  proportion  of  which  (depending  upon  the  learning  rate)  is  added  or 
subtracted  equally  amongst  the  physical  addresses.  Thus  if  a  training  pattern  is  presented  to  the  system 
repeatedly,  its  output  gets  closer  and  closer  to  the  actual  output,  until  training  is  stopped  or  no  change  in 
memory  content  take  ^ace.  Recent  studies  have  proved  the  convergence  of  the  algorithm  [16]. 

In  the  testing  phase,  if  a  new  pattern  is  slightly  shifted  with  respect  to  one  of  the  training  patterns  the 
output  error  will  be  determined  by  the  contents  of  the  virtual  addresses  that  are  not  common  between  the 
new  pattern  and  the  previously  trained  pattern.  A  consequence  of  this  scheme  is  that  slightly  corrupted 
inputs  will  give  slightly  corrupted  outputs. 

2.2  Modifications  to  the  CMAC 

We  modified  the  CMAC  to  be  used  as  a  NN  classifier.  Briefly,  the  design  patterns  to  be  learnt  by  the  CMAC 
are  identified  by  their  an  index  and  stored.  When  a  new  pattern  is  presented,  a  subset  of  the  learned  design 
patterns  is  retrieved,  amongst  which  the  nearest  neighbor  is  computed.  The  number  of  patterns  retrieved  is 
a  function  of  four  parameters  of  the  CMAC;  the  memory  partition  size,  the  block  size,  the  number  of  levels, 
and  the  offset  of  the  levels  with  respect  to  each  other. 

The  CMAC  as  a  NN  classifier  operates  in  two  phases;  a  learning  phase  and  a  classification  phase.  In 
the  learning  phase,  the  the  design  patterns  indexed  by  their  sequence  number  are  input  to  the  system. 
The  sequence  numbers  used  are  just  to  identify  the  design  patterns  and  their  ordering  has  no  effect  on 
the  performance  of  the  system.  For  each  design  pattern,  the  system  computes  a  virtual  address,  which  is 
mapped  to  a  physical  address.  The  physical  addresses  are  organized  as  a  sequence  of  bits.  A  “1”  in  the 
bit  indicates  that  the  design  pattern  has  been  hashed  into  this  physical  address,  while  a  “0”  indicates  an 
absence  of  such  a  hash.  At  the  hashed  physical  address,  the  bit  that  corresponds  to  the  index  of  the  design 
pattern,  is  set.  The  virtual  to  physical  address  hashing  and  bit  setting  is  done  for  each  level  of  the  CMAC. 
In  addition,  the  design  patterns  are  stored  in  an  array  in  the  sequence  they  are  encountered. 

In  the  classification  phase  the  virtual  address  of  the  new  pattern  is  first  computed  for  each  level  which  is 
then  mapped  to  a  physical  address.  The  indices  of  all  the  bits  that  are  set  to  “1”  at  that  physical  address  are 
retrieved  since  they  correspond  to  patterns  that  are  in  the  neighborhood  (defined  by  the  block  size)  of  the 
current  pattern.  This  retrieval  is  done  for  each  level  and  a  count  of  the  number  of  retrievals  for  each  index 
is  kept.  The  indices  selected  the  most  number  of  times  are  probable  candidates  for  the  nearest  neighbor.  If 
more  than  one  maximum  exists,  all  of  the  maximums  are  selected  and  the  distance  between  the  new  pattern 
and  each  of  the  retrieved  patterns  is  computed.  The  smallest  amongst  these  distances  is  the  nearest  neighbor. 
Note  that  it  is  not  necessary  for  the  system  to  retrieve  the  same  number  of  samples  for  each  new  pattern. 

A  number  of  properties  of  the  CMAC,  make  it  attractive  for  use  as  a  NN  classifier.  The  training  is  “one  shot” 
and  and  does  not  involve  any  intense  computation.  The  ordering  of  the  design  samples  is  not  important  and 
they  can  be  input  to  the  system  even  incrementally  without  affecting  the  previously  learnt  patterns.  The 
retrieval  of  a  subset  of  design  patterns  reduce  the  number  of  distance  computations  making  the  CMAC  an 
efficient  NN  classifier.  The  size  of  the  retrieved  subset  of  patterns  can  be  controlled  by  the  user  to  a  limited 
extent  by  specifying  the  various  CMAC  parameters. 

3  Experimental  Details 

The  CMAC  NN  classifier  was  implemented  on  a  Sparc  station  2  running  the  4.3BSD  UNIX  operating  system 
using  C  language.  The  system  performance  was  graded  on  two  measures  -  efficiency  and  recognition  rate 
which  are  described  in  the  next  section.  Three  data  sets  were  used  in  the  experiments,  the  details  of  which  are 
described  in  section  3.2.  In  Section  3.4  we  report  the  effects  of  some  CMAC  parameters  on  the  performance. 
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3.1  Performance  measures 

As  indicated  in  the  earlier  sections,  the  CMAC  is  an  efficient  NN  classifier.  It  is  therefore  natural  to  choose 
to  measure  the  computational  gain  obtained  on  the  data.  The  measure  chosen  should  be  data  independent 
as  much  as  possible.  The  efficiency,  tj,  is  defined  for  this  purpose  as,  t;  =  1  —  where  Nc  is  the  number  of 
distance  computations  using  CMAC,  and  Nn  is  the  number  of  distance  computations  using  traditional  NN 
classifier.  Though  data  dependent,  we  also  report  the  recognition  rate,  7,  which  is  defined  as  7  =  ^  where 
Nf  is  the  number  of  correct  newest  neighbors  determined  and  Nt  is  the  total  number  of  test  patterns. 


3.2  Data 

The  “vowel”  data  set,  used  for  speech  recognition,  has  often  been  used  as  a  benchmwk  ANN  systems  to 
compwe  the  performance  of  different  networks.  The  data  which  was  collected  by  Deterding  [17]  consists  of 
11  steady  state  vowels  uttered  by  15  speakers,  8  males  and  7  females,  with  each  speaker  repeating  a  vowel  6 
times.  Of  the  90  sets,  the  first  48  sets  were  used  for  training  and  the  other  42  sets  were  used  for  testing.  In 
effect,  there  are  528  training  saunples  and  462  testing  samples.  The  actual  data  is  a  lO-dimensional  vector 
whose  components  are  based  on  the  log  area  ratios.  The  published  literature  [18]  indicates  that  classification 
based  on  this  data  has  been  difficult  and  the  “best”  classification  rate  obtained  so  far  is  56%  using  the  NN 
approach. 

The  second  set  of  data  called  the  “sonar”  data  was  first  used  by  Gorman  et.  al  [19].  The  data  was  obtained 
from  sonar  returns  bouncing  off  two  types  of  materials;  metal  cylinder  and  roughly  cylindrical  rock,  at  various 
angles  and  under  various  conditions.  Each  pattern  is  a  60  dimensional  vector  whose  components  are  in  the 
range  [0,1].  Each  component  represents  the  integrated  energy  over  a  particular  frequency  band.  There  are 
104  training  patterns  and  104  testing  patterns. 

The  third  set,  called  the  “fingerprint”  data,  consists  of  2000  training  patterns  and  2000  test  patterns  each 
belonging  to  one  of  5  classes.  Each  pattern  has  112  components.  The  data  was  obtained  from  images 
of  fingerprint  impressions  following  feature  extraction.  Direction  vectors  served  as  features.  The  initial 
correlated  1680  dimensional  vector  was  reduced  to  112  by  performing  the  Karhunen-Loeve  transform. 

3.3  Classification  on  large  data  set 

Though  a  recognition  rate  of  100%  can  be  obtained  by  adjusting  the  parameters  of  the  CMAC,  we  also  want 
to  achieve  computational  gains.  Therefore,  subjecting  the  three  sets  of  data  to  the  CMAC  NN  classifier,  we 
tabulate  what  we  consider  our  best  results. 


Data  set 

Efliciency,  rj 

recognition  rate,  7 

1-NN  recognition  rate 

fingerprint  data 

0.99 

74.95 

82.85 

sonar  data 

0.98 

67.30 

91.34 

vowel  data 

0.91 

50.21 

56.27 

With  all  the  data  sets,  the  CMAC  NN  classifier  is  able  to  perform  at  over  90%  efficiency  with  slight  degra¬ 
dation  in  recognition  rate.  Even  though  the  recognition  rate  of  the  sonar  data  using  the  traditional  NN 
classifier  seems  much  higher,  that  same  rate  can  be  obtained  at  the  cost  of  some  efficiency.  As  indicated 
above,  these  values  do  not  reflect  the  best  recognition  rates  but  rather  the  compromise  between  efficiency 
and  recognition  rates. 

3.4  Effect  of  number  of  levels  and  block  size 

Figure  2  is  a  plot  of  the  average  number  of  retrievals  versus  the  number  of  levels  while  Figure  3  is  the  plot 
of  recognition  rate  versus  the  number  of  levels  in  the  CMAC.  These  plots  are  for  the  “vowel”  data.  As  the 
number  of  levels  increase,  there  is  a  drastic  reduction  in  the  number  of  samples  retrieved  initially,  after  which 
it  becomes  more  or  less  uniform.  Also,  depending  upon  the  pattern  distribution,  in  general,  the  smaller  the 
initial  set  retrieved,  the  less  the  probability  of  finding  the  nearest  neighbor  in  the  retrieved  set.  This  causes 
the  reduction  in  recognition  rate  with  increased  nunmer  of  levels  as  seen  in  Figure  3.  The  isolated  peaks  in 
the  figure,  we  believe  are  caused  by  the  distribution  of  data,  and  the  main  point  to  note  here  is  the  general 
decrease  in  performeince. 

Figures  4  and  5  indicate  the  plot  of  the  average  number  of  re  vals  and  recognition  rate  with  increase  in 
block  size.  The  block  size  is  indicated  as  a  percentage  of  the  maximum  range  of  the  input  vector  component. 
As  the  block  size  increases,  more  and  more  patterns  tend  to  be  retrieved  as  one  would  expect  and  therefore  the 
average  number  of  initial  retrievals  increase,  and  recognition  rate  increases.  Again,  peaks  in  the  performance 
curve  are  caused  by  the  distribution  of  the  design  patterns. 

While  it  is  possible  to  obtain  a  good  recognition  rate  by  increasing  the  number  of  retrievals,  for  applications 
where  the  exact  nearest  neighbor  is  not  so  crucial,  a  reasonably  good  performance  can  be  achieved  by 
retrieving  only  a  small  set  of  design  patterns.  Thus  one  can  establish  a  tradeoff  between  accuracy  and 
computational  complexity. 
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Figure  2:  Average  number  of  retrievals  as  a  func¬ 
tion  of  number  of  levels 


Figure  3:  Recognition  rate  as  a  function  of  number 
of  levels 


Figure  4:  Average  number  of  retrievals  as  a  func¬ 
tion  of  block  size 


Figure  5:  Recognition  rate  as  a  function  of  block 
size 


3.5  Computational  savings 

The  fingerprint  data  Krves  to  illustrate  the  performance  of  the  CMAC  NN  classifier  on  a  data  set  that  is 
large  and  has  a  high  dimensionality.  The  table  below  compares  the  performances  of  the  CMAC  NN  classifier 
with  the  traditional  NN  classifier.  With  little  degradation,  the  the  average  number  of  retrievals  for  distance 
computation  is  reduced  drastically.  As  indicated  earlier,  in  applications  where  computational  speed  is  more 
important  than  a  slight  loss  in  accuracy,  the  CMAC  NN  classifier  is  best  suited. 


Performance 

Avg  #  of  distance  computations 

time  (in  seconds) 

Brute  force  NN 

82.85% 

2000 

1639 

CMAC  NN 

74.95% 

6.686 

423 

3.6  Interpretation  of  results 


The  results  of  our  experiments  with  real-world  data  point  to  a  number  of  useful  features.  Learning  is  a  one 
shot  process  that  requires  no  preprocessing,  or  ordering  of  the  data,  which  eliminates  the  computation  for 
“preparing”  the  data  that  is  usually  associated  with  fast  retrieval  methods.  This  method  can  work  with  a 
limited  memory,  the  only  effect  of  using  limited  memory  would  be  that  the  retrieved  sample  set  increases. 
The  number  of  initial  index  retrievals  can  be  controlled  to  a  certain  extent  by  specifying  the  par^uTleters  of 
the  CMAC.  For  exsunple,  if  a  large  set  is  needed,  the  block  size  can  be  increased. 

There  is  a  compromise  between  the  initial  number  of  indices  of  patterns  retrieved  and  the  computational 
complexity.  For  example  if  more  number  of  indices  are  retrieved,  a  larger  search  for  the  nearest  neighbor 
is  required  amount  the  retrieved  pattern,  while  if  a  small  set  is  retrieved,  there  is  a  chance  that  the  actual 
nearest  neighbor  is  not  in  that  small  set.  In  applications  where  the  exact  nearest  neighbor  is  not  crucial, 
this  method  can  help  reduce  the  computation  by  a  large  factor. 


4  Conclusions  and  future  research 

We  have  presented  a  new  method  for  finding  the  nearest  neighbor  of  a  pattern,  using  the  CMAC  model. 
The  nearest  neighbor  is  searched  for,  amongst  a  small  set  of  retrieved  indices  of  initially  stored  patterns. 
By  controlling  the  parameters  of  the  CMAC,  we  can  control  the  size  of  this  set.  While  a  large  retrieved  set 
will  required  a  lot  of  computations  to  search  for  the  nearest  neighbor,  a  small  set  will  tend  not  to  have  the 
actual  nearest  neighbor  present.  Depending  upon  the  application,  a  suitable  compromise  can  be  reached. 

At  present,  we  are  looking  into  two  possible  extensions  of  this  approach.  One  is  to  be  able  to  specify  hyper- 
spherical  shaped  blocks,  so  that,  while  using  the  Euclidean  distance  metric,  unnecessary  sample  patterns 
are  not  retrieved.  The  second  extension  is  to  order  the  retrieved  small  set  to  obtain  the  ib-NN’s  of  the  new 
pattern.  This  is  a  rather  simple  extension,  though  at  the  moment  we  have  not  implemented  it. 
Acknowledgments:  We  ru:knowled{^e  the  use  of  the  NIST  database  for  the  “fingerprint”  data,  and  the 
University  of  California,  Irvine’s  repository  of  machine  learning  databases  for  the  “sonar”  and  “vowel”  data. 
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Abstract 

When  two  dimensional  images  are  used  as  input  to  a  neural  network,  the  noise  frmn  the  input  device  and 
small  deformations  in  the  end  parts  that  occur  in  the  processes  of  separating  each  pattern  and  size  nmaralization  lead 
to  images  shifted  from  the  original  learned  image  being  input  to  the  neural  network  which  is  a  maijor  cause  of 
misrecognition.  In  existing  multi-layer  perceptions  using  standard  EBP,  it  is  difficult  to  solve  shift  invariant 
problems  because  pattern  pixel  values  are  presented  directly  to  the  neural  netwwk  input  nodes.  Second  order  neural 
network  inputs  consist  of  geometrically  related  nonlinear  combinations  of  two  pixels,  and  can  be  used  for  shift 
invariant  pattern  lecognititHi.  But  the  number  of  Second  order  neural  netwcnk  input  nodes  increases  in  proportion 
to  N^,  where  N  is  the  dimension  of  the  ir^rut  patterns,  even  if  we  cmly  consider  shift  invariance.  Such  large  number 
of  input  nodes  lead  to  slower  learning  and  recognition. 

In  this  paper,  we  prtqiose  a  method  for  reducing  the  numbaof  shift  invariant  second  order  neural  network 
input  nodes  using  combinations  of  input  pattern  pixels  and  PCA(Principle  Component  Aitalysis).  Using  the 
proposed  mediod,  we  are  able  to  implement  a  shift  invariant  seccmd  order  neural  network  with  2/S*N  iHxles.  Due  to 
the  reduced  luimber  of  input  nodes,  a  50%  reduction  in  the  learning  and  recognition  time  was  obtained. 


L  Introduction 

Multi-laya  perceptions  using  EBPCEnor  Back  Propagation)  learning  rule  have  attracted  a  great  deal  of 
interest  recently  in  the  field  of  pattern  reoognitimi  where  solutkais  using  existing  algorithmic  methods,  because  of 
it's  simplicity  and  superior  problem  solving  amiabilities.  But  the  perception  has  a  low  rate  of  recognition  for 
patterns  which  has  been  gecmieliically  transformed  (lOtatitHi,  scaling,  shi^g).  This  shows  that  the  multi-laya 
perceptrcm  model  has  a  weak  point  in  recognizing  geometrically  transformed  patterns,  one  of  the  basic  problems 
of  pattern  recognition[2].  Geonetiic  invariance  is  important  in  patton  recognition  because  the  target  locadcm  and 
orientation  is  usually  unknown.  Among  all  the  variations,  shift  variation  generally  occurs  in  two  dimensional 
pattmn  recognition.  Shift  variation  lefa  to  the  shifting  of  the  entire  extracted  pattern  as  a  result  of  a  small  amount 
of  noise  at  the  end  points  of  the  pattern  diat  occur  during  the  process  of  extracting  indivisual  patterns  from  the 
entire  scene[l]. 

To  overccHne  such  gemnetiic  variations,  leseaiches  cm  using  higher  otder  neural  networks,  which  can 
learn  variations  by  itself,  have  been  actively  conducted.  Recent  research  has  shown  that  because  invariances  can  be 
built  into  the  architecture  of  higher  order  neural  networks,  they  can  be  effectively  used  for  invariant  pattern 
recognition(23].  But  stiaightforeward  use  of  higho:  order  neural  networks  is  limited  in  actual  implementations 
because  of  the  ccmbinatorial  explosion  of  the  number  of  input  laya  tHxles  in  proportion  to  the  input  pattern 
dimension.  If  the  input  pattern  to  be  learned  is  of  dimension  N,  the  number  of  input  layer  notfos  increases  in 
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0(N^),  even  if  we  only  consider  tbe  second  order  neural  networks.  This  is  the  m^«'  obstacle  to  actual 
imptementatioo.  Also,  such  large  number  of  input  nodes  lead  to  slower  learning  and  recognition[S,7]. 

There  have  been  many  approaches  which  maintain  the  getxnetric  invariance  properties  of  higher  order 
netwofks  while  reducing  the  numbw  of  input  nodes.  For  example,  sigma-pi  netwtnks  use  a  hidden  layer  of  higher 
order  nodes(10].  This  strategy  retains  the  higher  order  network  properties,  but  its  learning  speed  is  slower. 
Another  approach  is  to  use  a  priori  infmmatioo  to  remove  the  tenns  which  are  irrelevant  to  the  problem  in  a  single 
layer  of  higher  order  nodes[S,7].  However  since  it  is  often  difficult  to  find  the  properties  of  input  pattern  ^>ace  a 
priori,  this  streiegy  has  limited  applications. 

Recently,  the  problems  of  the  above  methods  have  been  overcome  be  representing  pixel 
combinations  in  tbe  same  relation  in  the  ccmtext  of  the  invariance  wanted  by  a  single  representative  iH)de[l,4,6]. 
Using  such  methods  allow  an  0<N)  implementation  of  high  order  neural  netw<Mts  with  N  dimensional  input 
patterns.  For  example,  shift  invarient  second  order  neural  netwraks  have  2*N  input  nodes  fix  N  dimensional  input 
patterns.  But  even  such  methods  still  need  a  large  number  of  input  nodes  necessitating  longer  learning  and 
recognititHi  lime. 

In  this  paper,  as  a  part  of  an  on  going  research  to  reduce  the  number  of  high  order  neural  network  input 
nodes,  we  prc^iose  a  method  for  reducing  the  number  of  shift  invariant  second  order  neural  network  input  nodes  to 
close  to  the  input  patton  dimension  N  using  PCA(Rrincipal  Ccanponent  Analysis)  and  pattern  pixel  ctunbinations. 
Using  the  proposed  method,  we  are  able  to  implement  a  shift  invariant  second  order  neural  netwcak  only  with 
2/S*N  nod^.  Experiments  using  the  implement^  neural  n^rxk  on  shifted  Kraean  Munjo  character  set  resulted 
in  about  9S%  recognition  rate.  Due  to  reduced  number  of  input  nodes,  reduction  in  tbe  learning  and  recognition 
lime  was  also  obtained. 

Ihe  rest  of  tbe  paper  is  structured  as  foUowes.  In  section  II,  we  show  an  OfN)  shift  invariant  second  order 
neural  network  implementatioo  using  pixel  cranbinations.  Section  HI  details  the  {xocess  of  implementating  a 
further  reduced  shift  invariant  second  order  neural  netwtxk  using  principal  component  analysis  on  the  pixel 
oombinadons  obtained  in  section  II.  Section  IV  presents  the  experimental  results  that  show  tbe  proposed  reduced 
seccmd  order  neural  netwr^  is  superior  to  the  existing  non-reduced  second  rader  neural  network.  The  papa  ends 
with  a  conclusion  in  section  V. 


n.  Shift  Invariant  Second  Order  Neural  Network  using  Pixel  Combinations. 


Second  rader  neural  networks  have  the  lowest  order  among  higher  order  neural  networks  and  have  shift 
and  size  invariant  properties.  These  prcqtetties  are  due  to  the  fact  that  second  order  neural  netwraks  perceives  the 
relationsbip  between  two  pixels  combinatioas. 

The  operation  equation  for  second  order  neural  netwtxk  is  shown  in  equation  (1).  To  obtain  shift 
invarumce,  we  need  only  to  take  tbe  two  dimentional  corelation  terms  of  equation  (1)  into  account.  Equation  (2) 
shows  tbe  terms  of  equation  (1)  that  affect  tbe  relative  positions  of  tbe  inputs,  only  these  terms  need  be  taken  into 
account  for  shift  invariaiioe[2,3]. 


D-l  D 


=  Xw,(ii,4i  + 1 1  +  X  W^I  +  W, 

j-1  j=Ik=j+l  j=l 


j=l  k=j+l  ^2) 

For  shift  invariant  learning,  weights  in  shifted  positions  are  updated  simultaneously  according  to  equation 
(3).  So  equation  (2)  can  be  rewritten  as  equation  (4). 

<3) 

**1  “X  X^i(k-j)^j^k 


j=l  k=i+I  (4) 

In  shift  invariant  second  order  neural  networks,  since  all  weights  in  shifted  positions  have  the  same  values, 
all  the  two  pixel  cranbinations  are  added  beftxdiand  and  the  cumulative  value  can  he  represented  by  a  single  input 
node  of  tbe  second  order  neural  network.  This  feature,  summation  of  all  tbe  two  pixel  conbinations  for  given 
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relative  position,  is  called  SOPCSummation  Of  Products)  w  second  otda  feature!  1,7]. 

SOP  =  . )i  (5) 

In  obtaining  second  order  features,  the  direction  must  be  taken  into  account  There  are  four  directions 
which  two  {Mxel  combinations  can  have.  So,  second  order  feature  can  be  represented  by  a  2'*'N  -  (L-t-M) 
dimensional  vector  with  each  element  reiaesenting  the  numbers  of  the  relative  position  pixel  combinations.  Fig  1. 
shows  the  possiUe  directions  of  two  pixel  combinations  and  an  examine  of  the  second  ordo^  feature. 
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Fig  1.  Directions  of  two  pixel  combinations  and  example  of  second  order  feature 


As  can  be  seen  above,  shift  invariant  second  order  neural  network  can  be  implemented  by  first  order 
multi-layer  perception  which  has  about  2*N  input  nodes  when  using  the  second  order  features. 


ni.  Reduction  of  Second  order  Neural  Network  using  PCA(Principal  Component 
Analysis) 

Shift  invariant  second  order  neural  network  using  second  order  features  considering  position  and 
directional  correlations  has  CHN)  input  nodes.  But  the  size  of  the  neural  network  increases  twice  as  fast  as  the 
dimension  of  the  real  pattern.  To  be  realistically  applied  to  pattern  recognition  systems,  the  learning  and 
recognitkm  process  must  be  achieved  as  quickly  as  possible.  Thus  it  is  preferable  to  reduce  the  number  of  input 
nodes  of  second  order  neural  networks  as  much  as  possible. 

PCAfPrincipal  Component  Analysis)  is  a  statistical  tool  which  yields  substantial  data  reduction  by 
representing  each  pattern  in  terms  of  a  relatively  small  subset  of  orthonormal  features(Principal  Ctnnponent) 
extracted  fiom  the  input  set{8,9].  The  Principal  Components  are  eigenvectors  of  the  covariance  matrix  formed 
frran  the  pattern  set  Each  eigenvalue  is  equal  to  the  variance  of  the  projections  of  the  patterns  onto  the 
corre^xmding  principal  component  The  (uincipal  components  can  be  obtained  by  using  the  diagonal  terms  from 
the  diagonalized  covariance  matrix.  Then  the  values  obtained  from  the  PCA  are  equivalent  to  the  variance  of  each 
diiiiension  of  each  pattern.  So,  the  variances  of  each  dimension  can  be  used  as  Principal  Components[8,9]. 

Fig.  2  shows  the  top  80%  of  the  variance  of  each  dimension  from  990  chracter  patterns  Grtun  the  Korean 
Munjo  chracto'  set  As  can  be  seen  from  Fig  2.  dimensions  of  smaller  variances  differ  very  little  fw  most  of  the 
patterns.  Therefore  such  dimentions  only  increase  the  teaming  and  recognition  time  white  contributing  very  little 
to  the  overall  recognition  rate.  By  eliminating  these  dimension  from  the  second  order  features,  improvemm  for 
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learning  and  recognition  time  can  be  obtained  without  adversely  affecting  the  recognition  rates. 
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Fig  2.  distribution  of  second  order  feature  values  for  each  dimension 


IV.  Experimental  results 


Second  order  neural  networks  reduced  to  2*N  input  nodes  using  pixel  combinations  are  shown  to  be 
further  reducible  by  using  PCA  without  adversely  affecting  the  recognition  rate  through  experiments  of  990 
patterns  from  the  Korean  Mungjo  charcter  set  The  experiments  compared  the  results  of  using  a  fixed  percentage 
of  the  second  order  features  with  tte  highest  variance  values.  We  have  conducted  experiments  using  the  top  10%, 
20%,  30%,  40%,  50%,  70%,  and  100%  of  the  second  order  features.  Fig.  3  shows  the  resulting  recognotion  rate, 
learning  time  and  recognition  time. 

First  of  all,  using  second  order  neural  netwmks  rather  than  first  order  neural  networits  show  a  far  higher 
recognition  rate  proving  that  theshift  invariance  {uoblem  has  been  overcome.  Also,  comparing  the  results  of  using 
the  top  20%  with  the  results  of  using  all  second  order  features,  we  can  see  that  using  only  the  top  20%  of  the 
features  results  in  a  80%  reduction  in  the  number  of  the  input  nodes  and  a  50%  reduction  in  learning  and 
recognition  time. 


#s  of  node  &  tunes  for  learning  and  recognition 


Fig  3.  Results  of  Eiqieriments  using  reduced  second  orcter  neural  networks. 

From  the  experimental  results,  we  can  see  that  the  second  order  features  obtained  through  pixel 
combination  still  has  redundant  information  that  does  not  contribute  to  improving  the  recognition  rate. 
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EUminating  such  redundant  information  tbrougb  PCA  can  lead  to  a  further  leductitm  in  the  number  of  input  nodes 
without  adversely  affection  the  recognition  rate  while  also  improving  the  learning  and  recognition  speed. 


V.Conclusion 

In  this  paper,  we  propose  a  method  for  reducing  the  number  of  second  order  neural  network  input 
nodes  for  solving  the  problems  caused  by  the  0(N^)  increase  in  the  number  of  input  nodes  -  difficulties  in 
implementations,  increases  in  learning  and  recognition  time  --  through  pixel  combination  and  PCA.  Using  the 
proposed  method,  we  are  able  to  implement  second  order  network  with  only  2/S*N  input  nodes.  Reduction  in  the 
learning  and  recognition  time  is  also  obtained. 

Because  geometric  invariances  are  important  for  the  pattern  recognition,  there  are  many  researches  being 
conducted  on  using  higher  order  neural  network.  But  combinat(»ial  explosion  of  input  nodes  is  the  main  obsticle  to 
research.  So,  researches  for  solving  this  problem  is  progressively  increasing.  In  the  proposed  method,  we  adopt  a 
statistical  tool(PCA)  to  reduce  the  number  of  higher  order  neural  network  input  nodes.  So  the  proposed  method 
can  be  used  regardless  of  the  given  pattern  class. 
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Abstract 

Temporal  (apatiotemporal)  sequences  are  a  fundamental  form  of  information  both  in  natural  and  engineered 
systems.  The  Inological  control  process  which  directs  the  generation  of  iterative  structures  from  undifferentiated 
tksne  is  a  type  of  temporal  sequential  process.  A  quantitative  explanation  of  this  temporal  process  is  reaction- 
dHftuJon,  initially  proposed  by  ‘I^iring[1952]  and  later  widely  studied  and  elaborated. 

We  have  adapted  the  reaction-diffusion  mechanism  to  create  a  novel  network  and  algorithm  based  on  a 
diemical  ‘neuron*  model,  which  performs  storage,  associative  retrieval  and  prediction  for  temporal  sequences. 
Experiments  demonstrate  the  ability  of  the  device  to  achieve  any  desired  depth,  limited  only  by  storage  capacity, 
to  remember  and  predict  on  the  basis  of  count  to  any  length,  and  to  learn  an  embedded  Reber  grammar  to  98% 
accuracy  and  permit  retrieval  with  controllable  redundancy. 


1  Introduction 

A  fundamental  class  of  biological  mechanisms  is  widely  believed  to  control  the  growth  of  repetitive  struc¬ 
tures  such  as  insect  leg  segments,  periodic  patterns  such  as  the  stripes  of  a  zebra,  and  similar  sequmices 
which  are  largely  but  not  exactly  repetitive.  The  underlying  biological  process  has  been  explained  quanti¬ 
tatively  by  the  reaction-diffusion  process  consisting  of  a  set  of  partial  differential  equations  that  describe 
the  space-time  concentration  of  chemical  morphogens  responsible  for  stimulation  of  growth.  Reaction- 
diffusion  is,  in  a  word,  a  natural  spatiotemporal  sequential  process  which  we  wish  to  exploit. 

On  the  engineering  side  of  the  ledger,  the  storage  and  retrieval  of  spatiotemporal  sequences  has 
received  a  good  deal  of  attention  by  reason  of  their  fundamental  place  in  the  simulation  of  cognitive 
processes.  A  few  of  the  many  proposals  for  TSP(temporal  sequence  processing)  are  time-delay  neural 
nets,  recurrent  multUayer  feedforward  nets[Elman  1990,  Jordan  1987],  gamma  delay  networks[deVries 
&  Principe  93],  and  the  gaussian  delay  network,  TEMP02  [Bodenhausen  k  Waibel  1991].  The  basic 
problem  is  to  find  a  viable  procediue  for  projecting  the  history  of  a  sequence  into  the  present  so  that 
the  past,  back  to  some  ’depth’,  can  be  made  to  influence  the  present  or  future.  Of  course,  we  desire  the 
device  or  network  to  be  as  fundamentally  simple  as  possible  and  also  biologically  plausible. 

In  the  following  sections,  we  will  (1)  discuss  the  objectives  of  TSP  and  some  previous  efforts,  (2) 
explain  how  reaction-diffusion  operates  biologically,  (3)  define  our  TSP  model,  (4)  discuss  some  experi¬ 
mental  tests  of  the  model  and  (5)  summarize  the  qualities  rmd  limitations  of  the  Re-Di  model. 

2  Objectives  for  a  Temporal  Sequence  Processor 

The  two  distinct  ways  that  we  want  to  apply  the  TSP  to  process  sequences  are  reviewed  next. 

1.  Embedded  Sequence  Recognition  (ESR)  -  A  number  of  short  pattern  sequences,  PS  = 
(psi,ps3, . . .),  are  to  be  learned  by  the  device.  An  unbounded  argument  sequence  ARGSEQ,  is 
compared  to  the  set,  PS,  to  determine  if  any  one  of  the  stored  pattern  sequences  is  embedded  in 
the  ARGSEQ.  The  stored  patterns  could  be  meaning-bearing  features  of  signals  and  the  objective 
is  to  identify  the  existence  of  features  in  a  semi-infinite  signal.  The  process  of  matching  a  specific 
external  sequence  to  internal  stored  states  we  will  call  guided  sequence  retrieval. 
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2.  Sequence-Addressable  Sequence  Memory  (SASM)  -  A  ‘long’  8equence(8),  ST,  is  stored. 
Short  address  sequences  are  to  be  compared  to  ST  for  the  purpose  of  locating  regions  of  ST 
which  match  the  applied  address  sequence  (’address  comparison’,  a  variant  of  guided  sequence 
retrieval).  If  a  sufficiently  exact  match  to  the  address  sequence  is  found,  we  optionally  want  to 
read  the  continuation  of  ST  firom  the  match  point  without  further  external  guidance  (  free  sequence 
retrieval). 

Upon  close  examination,  storage,  guided  sequence  retrieval,  address  comparison,  and  free  sequence 
retrieval  all  exert  their  special  requirements  on  the  engineering  of  the  network.  Note  also  that  free 
sequence  retrieval  is  equivalent  to  prediction.  For  example,  when  the  long  stored  sequence,  ST,  is  a 
musical  melody  or  financial  time  series,  firee  sequence  retrieval  amounts  to  using  similar  past  behavior  to 
project  the  expected  future. 

Class  Flexibility  -  It  is  also  highly  desireable  in  many  applications  of  TSPs  that  the  system  has 
controllable  tolerance  to  variations  in  the  time  of  symbol  occurrence. 

In  this  sense,  the  stored  sequences  act  as  exemplars  of  classes,  a  useful  condition  oftra  found  in  neural 
nets. 

We  propose  to  explain  how  the  Re-Di  TSP  deals  with  all  of  these  foregoing  objectives. 


3  Biological  Reaction-Diffusion 


A  well-studied  biological  experiment  consists  of  the  surgical  removal  of  an  internal  segment  of  a  cockroach 
tibia  followed  by  regrafting  of  the  distal  and  proximal  parts  as  illustrated  in  Fig.l  [Meinhardt  1982}.  If 
the  original  tibia  consisted  of  a  sequence  of  similar  but  not  identical  segments  numbered  123456789, 
and  the  segment,  4567,  were  removed,  it  is  found  that  after  one  or  two  moults,  the  internal  segment 
sequence  is  regenerated,  in  this  case,  in  its  original  order.  This  experiment  implies  the  existence  of 
control  information  and  a  controlled  growth  process  which  stores  and  retrieves  sequences.  How  is  this 
to  be  explained?  The  quantitative  explanation  was  set  forth  some  40  years  ago[Turing  52]  as  a  set  of 
p.d.e.’s  which  are  self-stabilized  and  which  specify  the  growth  and  decay  of  morpbogens  stimulating  the 
regenerative  growth  of  the  segments.  An  example  of  reaction-diffusion  equations  are  given.^ 
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(2) 


Qi  represents  the  concentration  of  the  ith  morphogen  which  excites  the  growth  of  the  ith  segment  of  the 
tibia,  r  is  the  concentration  of  the  common  reactant  chemical,  and  the  coefficients  are  constants.  The 
D’s  are  the  diffusion  coefficients.  Assume  that  segment  8  emits  morphogen  gr.  If  r=l,  Sgj/St  will  grow 
by  positive  feedback  for  some  time  and  the  diffusion  term  (containing  2>,)will  diffuse  the  chemical  into 
the  segment  8/8egment  3  interface,  stimulating  growth  of  segment  7  material.  At  farther  locations,  the 
reactant  r,  which  is  set  to  diffuse  rapidly,  will  suppress  growth  of  gj.  Thus  the  strongest  stimulus  for 
growth  of  segment  7  will  occur  at  the  edge  of  the  segment  8  material.  Notice  that  reactant  r  is  generated 
in  proportion  to  the  sum  of  squares  of  morphogens  present  (eqn.2,  1st  term  on  right),  and,  by  appearing 
in  the  denominator  of  term  1  of  eqn.l,  suppresses  the  growth  of  all  morphogens. 

After  segment  7  has  grown,  it  emits  go  which  excites  the  regeneration  of  segment  6,  etc.  until  the 
surgically  removed  tibia  segments  regenerate(Fig.  1,  rightmost  sketch). 

The  reaction-diffusion  equations  permit  calculation  of  the  growth,  diffusion  and  decay  of  each  mor¬ 
phogen  in  turn  which  effectively  carries  the  history  of  previous  activity  forward  in  time  while  distributing 
the  information  spatially  by  diffusion.  This  is  the  natural  biological  mechanism  which  we  shall  imitate 
to  engineer  a  temporal  sequence  processor. 


^These  equations  are  somewhat  simplified  here,  hence  less  thu  fully  accurate,  by  reason  of  the  limited  space 
available  for  explanation.  See  [Meinhardt  1982]  for  the  complete  equation  set. 
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4  The  Re-Di  Temporal  Sequence  Processor 

4.1  The  Basic  Cell 

A  basic  cell  of  the  Re-Di  processor  or  network  has  the  following  properties; 

(1)  The  cell  has  a  single  label  which  specifies  the  chemical(morphogen)  which  it  emits,  i.e.,  the  cell 
labelled  m  emits  gm- 

(2)  When  activated,  the  cell  emits  one  unit  of  its  chemical  into  the  medium. 

(3)  The  cell  has  K  receptors.  Each  receptor  specifies  the  chemical  vector  which  will  activate  the  cell. 
The  kth  receptor  has  values,  =  (vi,  V], . . . ,  vjf)  where  N  is  the  maximum  number  of  chemicals 
and  Vi  corresponds  to  chemical  A  cell  is  activated  when  the  chemical  concentration  vector, 
G  =  {giiQi,  ■■■)  at  the  location  of  the  cell  is  sufficiently  near  equal  one  of  the  K  receptors  (V 
vectors)  of  the  cell.  More  precisely,  cell  C  is  activated  if 

N 

^  k=l 

At  birth,  all  cell  labels  are  blank  and  all  V-vectors  equal  0. 

4.2  The  Architecture  of  the  TSP 

A  general  TSP  is  composed  of  a  number  of  basic  cells  distributed  randomly  in  a  volume  suffused  by  a 
medium  which  supports  diffusion  of  the  chemicals(8ee  Fig.2).  In  the  present  report,  we  will  assume  the 
basic  cells  are  distributed  on  an  x,y  plane  at  integer  locations.  In  general,  there  may  be  any  number  of 
cells  with  the  same  label  but  we  will  limit  ourselves  to  one  cell  for  each  label,  that  is,  there  is  one  source 
for  each  chemical.  The  network  is  capable  of  generating  the  reactant  chemical,  r(x,y),  at  any  position 
(x,y)  where  a  basic  cell  exists.  It  is  also  assumed  possible  to  locate  the  particular  cell  corresponding  to 
an  external  symbol. 

The  growth,  decay  and  diffusion  of  chemicals  in  the  TSP  are  governed  by  the  following  equations. 
The  Reaction-Diffusion  (Re-Di)  Equations 


P=1 

Operation  of  the  TSP  is  best  described  with  an  example  which  follows. 


(3) 

(4) 


4.3  Storing  a  Sequence  (IVaining  the  Processor) 

Assume  the  processor  is  in  its  virgin  state  and  we  will  follow  the  steps  of  storing  the  sequence,  sisjss 
=  cdddf.  The  chemical  concentration,  G  =  0,  everywhere.  We  will  assume  a  5  x  5  array  of  cells  located 
at  at=  1,...,5,  y=l,...,5. 

Upon  applying  si  =  c,  one  cell  is  recruited  randomly  and  assigned  the  label,  c.  Suppose  it  is  at  x=3, 
y=3.  C(3,3)  is  labelled  the  active  cell(AC),  it  emits  one  unit  of  ge,  and  the  Re-Di  equations  are  applied, 
diffusing  ge  throughout  the  medium. 

Upon  applying  sj  =  d,  another  blank  cell  is  recruited  (say  at  x,y  =  2,3),  it  is  labelled  d,  flagged  as 
the  AC,  and  one  of  its  receptors  (V  registers)  is  set  or  trained  to  the  value  of  the  chemical  environment 
at  its  location  at  that  moment.  Only  chemical  ge  will  be  non-zero.  The  AC  then  emits  one  unit  of  gd  and 
the  Re-Di  equations  are  stepped  to  simulate  growth  or  decay  and  diffusion  of  all  chemicals  in  the  system. 
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Upon  applying  «s  =  d,  the  cell  labelled ’d’  if  located,  namely  C(2,3),  which  continues  as  the  AC.  If 
none  ^  the  V-repsters  of  C(2,3)  is  within  the  <r  tolerance  sone  of  the  eumnt  g{2, 3),  an  unused  V-register 
of  C(2,3)  learns  (is  set  to)  ^2, 3).  Then  one  unit  of  is  emitted  and  the  Re-Di  equaticms  are  ^plied. 

When  S4  =  d  is  applied,  the  procedure  for  sa  is  repeated.  If  (t  is  sufficiently  small,  the  fact  of  a  different 
chemical  environment  for  each  transition  results  in  a  unique  activation  vector  for  each  transition.  That 
is,  the  2nd  d  to  the  3rd  d  is  distinguished  from  the  3rd  to  the  4th  d. 

Finally  upon  applying  ss  =  f ,  a  new  cell  it  recruited,  labelled  f ,  and  set  to  recognise  the  g  which  carries 
the  history  of  the  sequence.  Eventually,  the  chemical  concentrations  decay  to  sero.  The  device  generates 
a  chemical  ‘alphabet  Boup’capa6/e  of  uniguelg  representing  the  historical  context  of  everg  transition. 

Retrieval  of  a  sequence  follows  the  same  procedure  as  the  storage  process  except  no  V-registers  are 
altered. 

Upon  future  re^>plication  of  the  same  sequence  for  the  purpose  of  recognising  the  stored  sequence, 
if  the  chemical  concentrations  are  noiseless  and  ^plied  with  the  same  timing,  the  conditions  stored  in 
the  V-registeis  will  be  exactly  reciq>itulated  causing  the  same  cell  activation  sequence  to  reemerge. 

Even  if  there  is  some  noise  or  the  timing  varies,  use  of  a  wider  tolerance  window  (greater  ff)  can  still 
permit  the  stored  sequence  to  be  recognized. 

5  Examples 

5.1  Count  (or  Depth)  Test 

The  ability  of  the  network  to  count  can  be  tested  directly  by  demonstrating  that  the  sequence,  cd°e 
can  be  distinguished  from  ed*^'*'^f  ,  that  is,  the  occurr«ace  of  n  d’s  is  distinct  from  n+1  d’s.  The 
test  consists  of  storing  a  sequence  such  as  cddddddd...  and  noting  the  point  where  successive  d’s  are 
not  distinguished  by  utilizing  a  new  V-register.  This  experiment  was  performed  with  various  values  of 
tolerance,  tr.  The  results  are  shown  below. 


COUNTING  TEST 
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Metaphorically,  larger  values  of  <r  correspond  to  paying  less  attention  to  the  precise  count.  As  the  count 
or  depth  increases,  the  penalty  is  a  proportionate  increase  in  the  required  storage  capacity  which  can  be 
defined  as:  Opacity  =  NeNv  where  Ne  =  the  number  of  cells  and  Ny  =  the  average  number  of  occupied 
V  registers  (activation  condition  registers)  per  cell. 

5.2  Embedded  Reber  Grammar 

An  embedded  Reber  grammar  was  generated  with  all  transiticm  probabilities  =  0.5.  The  network  was 
trained  with  500  strings  and  tested  with  500  different  strings.  Sequence  length  was  unrestricted;  string 
lengths  as  long  as  29  symbols  were  observed.  We  used  a  hard-line  accuracy  requirement,  namely,  every 
transition  in  a  string  must  be  predicted  by  the  network  in  order  to  score  a  Pass.  Usually  prediction  of 
the  penultimate  symbol  (t  or  p)  by  correlation  to  the  second  symbol  (t  or  p)  is  the  goal. 

Variation  of  the  activation  tolerance  measure,  tr,  demonstrates  the  range  of  generalization  possible 
during  storage  of  sequences.  In  the  limit  as  <r  — »  0,  every  transition  was  learned  uniquely  whidi  required 
about  600  V-registeis  to  record  an  average  4683  transitions.  Every  sequence  which  has  ever  been  presented 
is  uniquely  retrievable  when  9  =  0  during  storage.  When  er  increases,  some  distinct  strings  may  be  stored 
as  equivalent  but  fewer  V  registers  are  used  (e.g.,  about  300  V  registers  when  <r  =  0.005).  Ultimately 
when  <r  =  0.5,  the  representation  became  context-free,  that  is,  no  history  was  stored  —  any  transition 
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between  Reber  grammar  symbols  was  permitted.  Similarly,  during  Test  phase,  the  acceptable  tolerance 
for  each  transition,  <ttst  was  varied.  With  a  =  .001,0-757  =  .005,  98.4%  of  strings  were  recognized. 
Accuracy  for  various  combinations  of  storage  and  test  tolerance  are  summarized  in  Fig.  3. 

5.3  Time  Flexibility 

In  the  foregoing  cases,  retrieval  was  performed  with  the  same  timing  used  during  storage.  If  retrieval 
is  performed  at  a  rate  slower  or  faster  than  the  rate  diuring  storage,  the  chemical  concentrations  will  be 
different.  In  this  experiment,  the  storage(training)  algorithm  was  performed  with  a  unit  time  between 
each  symbol  by  using  a  unit  step  of  decay  and  diffusion.  To  simulate  retrieval  at  a  slower  rate  than 
storage,  we  performed  two  or  three  steps  of  the  Re-Di  equations  between  symbols.  In  Fig.  4,  the  error 
rate  is  the  percentage  of  sequences  in  which  at  least  one  transition  was  not  correctly  signalled.  The 
error  rate  can  be  reduced  by  increasing  the  activation  tolerance,  0757.  Almost  any  sequence  could  be 
retrieved  with  0757  =  0.2.  Increasing  0757  permits  an  increasing  number  of  transitions  to  be  seen  as 
permissible,  even  some  which  may  not  have  been  explicitly  seen  during  the  storage  operation.  Thus,  the 
cost  of  retrieving  at  a  rate  different  from  the  rate  used  during  storage  is  that  one  must  accept  a  higher 
level  of  uncertainty  that  the  object  sequence  was,  in  fact,  precisely  the  one  that  was  stored.  On  the  other 
hand,  temporal  spacing  in  the  stored  sequence  can  be  distinguished  in  the  retrieval  process,  if  desired, 
by  using  both  a  small  0  and  small  0757  at  the  cost  of  storage  capacity. 

5.4  Sequence  Addressable  Sequence  Memory 

SASM  entails  the  storage  of  comparatively  ’long’  strings  addressed  by  shorter  sequences.  The  storage 
algorithm  is  independent  of  the  lengths  of  sequences  being  stored. 

An  experiment  was  designed  to  test  the  ability  of  the  Re-Di  TSP  to  perform  in  SASM  style.  Six 
simple  musical  melodies  averaging  about  20  notes  duration  were  encoded  in  a  form  which  represents 
both  the  note  and  its  duration.^  All  of  the  melodies  had  a  common  internal  subsequence,  the  objective 
being  in  part  to  demonstrate  that  the  sequences  could  still  be  completely  distinguished  downstream  from 
the  common  point. 

After  storing  the  melodies,  a  brief  (3  to  6  note)  initial  string  from  an  arbitrary  stored  string  was 
used  as  the  address.  The  address  was  presented  in  Guided  Sequence  Retrieval  mode.  At  that  point, 
having  developed  the  chemical  distributions  corresponding  to  the  initial  part  of  one  of  the  stored  strings, 
the  system  reverted  to  Free  Sequence  Retrieval  meaning  that  each  successive  activated  node  was  decided 
competitively  by  selecting  the  unique  (usually)  node  with  the  largest  activation.  In  adl  cases  except  one 
special  condition,  the  expected  sequence  was  correctly  retrieved  all  the  way  to  its  terminal  symbol. 

6  Summary  and  Conclusions 

Temporal  sequence  processing  fundamentally  requires  memory  of  the  history  of  the  sequence,  usually 
supplied  by  a  delay  kernel.  In  the  current  paper,  we  have  explained  how  history  be  carried  by  a 
mechanism  anedogous  to  chemicals  growing,  decaying  and  diffusing  according  to  the  reaction-diffusion 
process.  This  obviates  the  need  to  propose  ad  hoc  shapes  or  specifications  for  the  delay  kernel.  The 
proposed  system  provides  full  control  of  depth  during  storage  by  a  single  pareuneter,  0.  Increasing 
depth  increases  required  storage  capacity;  decreased  depth  amounts  to  exp^ulding  class  generalization 
(decreased  resolution)of  sequences  during  storage.  Another  simileu  parameter,  0757,  can  be  set  to 
various  degrees  of  tolerance  of  transitions  during  retrieval.  The  system  is  capable  of  counting  to  any 
magnitude  (i.e.,  has  arbitrary  depth).  Large  depth  carries  the  cost  of  larger  memory  capacity,  however. 
The  Re-Di  system  attained  an  accuracy  of  98%  in  the  severe  test  posed  by  an  unrestricted-length 
embedded  Reber  grammar.  The  system  is  also  capable  of  reading  back  at  a  rate  different  from  that 
during  storage  but  at  a  cost  of  increasing  error  tolerance  as  the  difference  between  storage  and  retrieval 

^Unfortunately,  space  does  not  permit  the  fall  explanation  of  the  encoding  and  the  experiment  details.  A  fall 
description  is  in  preparation. 
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rates  increase.  It  is  also  possible  to  store  and  retrieve  not  only  the  spatial  vector  sequoice  but  also  the 
duration  of  each  vector. 
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Abstract 

In  this  paper  we  {nesent  a  computer  language  that  allows  you  to  create  neural  nets,  not  by 
training  them  but  by  iHOgramming  them. 

The  language  constructs  allow  you  to  program  at  a  high-level,  schema  level.  It  actually  locdcs  a 
little  like  Prolog.  However,  it  is  based  on  a  thewy  which  is  at  odds  with  the  idea  of  symbolic  logic  and 
predicate  calculus. 

The  idea  was  bran  out  of  the  realisation  that  from  a  software  engineering  point  of  view,  the  best 
way  of  creating  a  perceptual  recognition  network  may  be  to  program  it  rather  than  to  coax  a  learning 
algorithm  into  learning  examples  and  generalising  iqpx^riately. 

The  language  is  designed  for  pattern  recognition  tasks,  e.g.  character  /  speech  /  face  recognition. 
A  digit  recogniser  is  presented  in  the  paper  to  illustrate  the  general  principles. 

The  Connectarine  Develr^ment  Environment  is  implemented  in  C  under  MS-1X)S. 


Introduction 

Connectarine  is  a  language  that  allows  you  to  explicitly  program  neural-nets  at  the  schema  level. 
The  program  consists  mainly  of  a  set  of  productions  that  have  a  similar  syntax  to  Prolog,  which  interact 
to  genoate  all  the  neurons  and  aU  the  connections  in  the  netwok. 

The  language  was  originally  conceived  from  the  realisation  that  from  a  software  development 
point  of  view,  the  best  way  of  (Roducing  a  character  recogniser  or  qieech  recogniser  may  be  to  connect 
one  up  manually,  based  on  a  theory  of  feature  detectors,  rathor  than  to  use  a  learning  algwithm. 

In  addition,  the  resulting  neural  nets  are  editable,  their  (^ration  is  comprehensible  (they  are  not 
'black  boxes*  like  most  other  neural  nets)  and  they  are  very  efficient  (they  are  sparsely  connected  and  the 
neurons  are  systematically  created).  They  may  even  exhibit  better  perfc^mance,  because  of  the  element 
of  design  in  their  construction. 

A  Connectarine  {HOgram  takes  a  set  of  input  neurons  and  a  set  of  'productions'.  It  uses  these  to 
generate  more  neurons,  and  these  neurons  to  generate  further  neurons,  until  the  productions  don’t 
generate  any  more  neurons.  The  weights  are  generated  at  the  same  time  as  the  neurons. 

The  author  believes  that  neural  nets  are  the  most  natural  and  efficient  way  to  do  fiizzy 
recognition  tasks,  whether  they  are  trained  or  programmed. 


bitroduction  to  Connectarine 

The  basic  unit  in  Connectarine  is  the  neuron.  A  neuron  is  specified  as  an  identifier  with  various 
parametm.  The  productions  use  variables  and  expressions  in  place  of  the  parameters  to  define  new 
neurons  and  their  connections  and  weights. 

The  language  allows  you  to  create  layered  netwodcs,  general  feed-forward  networks,  feedback 
networics,  and  it  even  allows  you  to  define  infinite  networks. 

In  Older  to  ctmvey  the  concepts  behind  the  Connectarine  language,  let's  lode  at  a  piece  of 
Connectarine  code  and  skeleton: 

«  input:  on{l. .12, 1. .19)  » 

edgelet ("-■ ,x,y)  on(x-l,y) ,on(x,y) ,on{x+l,y) , 

-on(x,y-l) ,  -on(x,y+l) . 
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. . .etc . . . 

«  output:  digit (0.. 9)  » 


The  line  at  the  beginning  and  the  one  at  the  end  interface  a  Connectarine  program  with  the  outside  world. 
They  define  the  input  neurons  and  output  neurons.  In  the  middle  there  is  a  list  of  productions. 

In  this  case,  'on(x,y)'  are  the  input  neurons.  We  also  say  that  they  represent  the  activation  values 
of  those  neurons.  For  example,  'on(3,4)’  specifies  the  activation  value  of  position  (3,4)  on  the  bit-map. 

The  neurons  are  feature  detectors.  We  often  call  a  neuron  a  'concept  detector',  because  they  are 
used  to  rejnesent  concepts  in  the  more  general  sense  -  visual  features,  complex  visual  objects,  semantic 
objects,  etc.  (We  take  the  unorthodox  view  that  'meanings'  can  be  assigned  to  neurons  in  this  way  in  the 
computer  as  well  as  the  brain.  However,  our  reasons  and  our  qualifications  are  not  discussed  here). 

For  example,  "edgelet(''-'',x,y)''  represents  a  short  horizontal  edge  at  position  (x,y).  Neuron 
exi»essions.  where  variables  are  substituted  for  the  parameters,  are  used  in  the  productions.  A  {Hoduction 
is  something  of  the  form: 

<neuron-expr>  <list  of  neuron  expressions>. 

Connectarine  starts  with  the  input  neurons.  It  tries  to  fit  them  in  into  all  possible  places  in  the  right-hand- 
side  of  a  production.  This  'fltting  it  in'  involves  assigning  values  to  the  respective  variables.  These  values 
then  allow  us  to  evaluate  the  parameters  in  the  left-hand-side  and  create  a  neuron  with  those  paiameta-s. 

The  Connectarine  system  calculates  the  activation  level  of  a  neuron  in  the  usual  way  of 
summing  together  the  efferent  neurons  (on  the  right-hand-side)  and  passing  the  result  through  a  sigmoidal 
response  function.  The  programmer  has  control  over  the  steqmess  and  the  threshold  of  the  sigmoid 
function,  (the  'threshold'  referring  to  the  x-posidon  of  the  inflection  point). 

Connectarine  also  supports  fuzzy  logic,  however  it  is  the  author's  view  that  connectionist  logic, 
where  values  are  summed  and  sigmoided  and  two  half-true  statements  can  add  to  each  other,  is  more 
effective  and  natural  in  this  domain  than  fuzzy  logic.  For  example,  for  these  purposes  it  is  better  that  (0.S 
or  0.5)  >  (0.5  or  0),  than  (0.5  or  0.5)  =  (0.5  or  0). 

We  assume  that  any  neuron  not  in  the  database  of  neurons  has  an  activation  of  zero.  (Actually, 
this  is  an  oversimplification  of  the  algorithm,  but  it  is  generally  true). 


Programming  a  Digit-Recogniser  with  Connectarine 

The  input  to  a  character  recogniser  is  a  matrix  of  bit-map  pixel  activations.  By  creating  a  level 
of  edge  detectors  as  illustrated  in  the  last  section,  we  can  make  a  start  on  the  full  network. 

These  edgelets  get  put  together  into  'comerlets'  (little  ctnners)  and  'curvelets'  (little  curves)  and 
'edges'  (medium-sized  edges).  There  are  also  detectors  for  line-endings. 

These  features  get  built  together  to  frxm  medium-sized  comer-detectors  and  curve-detectors  of 
every  possible  orientation,  and  so  on.  Eventually  we  get  up  to  more  complex  features  such  as  The  top  of 
a  5'  and  'a  J-type  curve',  and  finally  to  '2'  and  '8'  and  so  on,  which  are  the  ouqnit  neurons. 

At  each  stage,  we  reduce  the  resolution  of  the  image  by  combining  together  neurons  in  the  same 
areas  of  the  bit-map.  At  each  stage  we  get  detectors  for  mote  complex  features,  with  larger  receptive 
fields.  Ultimately,  we  get  full  digit-detectors  whose  receptive  field  is  the  entire  bit-map,  and  these  form 
the  output  layer.  The  final  neural  net  has  about  1 1  layers. 

Here  are  extracts  from  a  digit-detector  program: 

«  Sigmoid(70%.6) » 

edgelet('’l'',x,y) :-  on(x,y-l),  on(x,y),on(x,y+l),  -on(x-l,y),  -on(x+l,y). 
edgelctCrJi.y) 2*on(x-l,y-l),  2*on(x,y),  2*on(x+l,y+l),  -on(x,y+l), 

-on(x-l,y),  -on(x+l.y),  -on(x.y-l). 


«  Sigmoid(35%,8) » 

comerlet(''l_",x/2,y/2) :-  edgelet(''rji.y+l).edgelei(T',x+l,y+l), 
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cdgelet("-" jc+2,y- 1),  edge[et("-"^+2,y). 
cufvclet(V.x/2.y/2) edgelct(T.x-l.y+l).  cdgelet("_".x+l.y). 
cuivc('Vr,x/2,y^) fuzzy_curvclet(’\.",x-l,y),  fuzzy_curvclet(’'_r,x+l,y). 

{  —  character-specific  neurons  — ) 

lines? edge("7”.x,y),  edge("7".x.y+l). 

bottoni5 curve(’\/’,x,y),  fuzzy_curve(")".x,y+2). 

{  —  Output  neurons  — - ) 

digit("2“) top2("2".x.y).cusp2(x,y-l). 

digit("5") side5(x,y),  top5(x,y),  curve(”5".x2,y2). 


In  all  things,  we  do  not  regard  features  as  either  being  there  or  not.  They  just  have  a  higher  or  lower 
activation  value.  For  example,  the  feature  detector  for  a  cross  will  get  activated  by  the  tiny  serif  at  the 
bottom  of  a  T'.  Even  these  very  small  activations  play  a  part  in  the  overall  process.  They  might  represent 
some  feature  which  should  be  there,  but  because  of  ted  printing  or  bad  handwriting  isn't  properly  there. 
They  can  still  affect  the  overall  outcome. 

This  network  will  recognise  a  character  no  matter  where  on  the  bit-map  it  is  situated.  It 
recognises  characters  of  different  sizes,  within  upper  and  lower  limits;  this  is  a  consequence  of  rules  such 
as: 

curve('\_r,x,y) :-  curve('V'',x-l,y),  curve('\.",x,y),  curve("_/”,x+l,y),  curve("_/".x+2,y). 

where  the  sigmoid  response  function  is  tuned  to  respond  when  just  some  of  these  are  activated.  When 
every  level  of  the  network  has  rules  such  as  this,  it  causes  recognition  to  occur  regardless  of  size,  except 
above  a  certain  size  and  below  a  certain  size. 

It  will  also  cope  with  a  certain  degree  of  rotation,  as  an  intrinsic  feature  of  having  feature 
detectors  that  accept  distortions.  In  fact,  you  can  look  at  wha>  happens  as  you  rotate  the  digit:  the 
activation  strength  of  the  output  neuron  representing  that  digit  is  at  a  maximum  when  the  character  is 
near  its  normal  orientation.  As  you  move  it  around,  the  activation  strength  drops  off,  until  there  is  no 
recognition  (activation  »  0). 


A  Digit-recogniser  in  perspective: 

This  program  recognises  single  digits  from  bit-maps.  It  takes  the  image  as  an  input  vector,  and 
has  10  output  neurons  corresponding  to  the  10  digits.  All  the  neurons  in  the  middle  represent  various 
features  at  various  positions.  For  example,  if  you  do  a  query  to  look  at  the  activation  of  the  horizontal 
edge-detectors  on 


you 


The  activation  of  each  of  the  10  output  neurons  corresponds  roughly  to  the  probability  it  is  that  digit.  You 
can  either  lote  for  the  highest  activation,  or  lote  for  the  highest  activation  but  require  that  it  be 
significantly  higher  than  the  next  highest,  or  just  take  all  the  output  neurons  and  feed  them  into  a  higher- 
level  netwc^. 

If  you  want  to  put  it  into  the  context  of  something  like  a  post-code  recogniser,  you  might  need 
digit-detectors  with  position  parameters  so  that  at  each  position  of  the  bit-map  (at  a  low  resolution, 
however),  you  have  neurons  detecting  each  digit.  You  should  get  something  like  this: 
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1264-1 

2-detectors:  -== . 

1 -detectors:  —  -=-  -  - 

5-detectors:  -  -  -===-  - 

4-detectors: . ===- 

There  is  interest  at  the  moment  in  recognising  characters  from  strcdce  information.  In  other 
words,  you  get  a  full  description  of  the  trajectory  of  a  pen  rather  than  a  bit-map  reinesentation  of  a 
character.  This  is  a  very  important  component  of  the  new  pen-based  notebook  computers.  This  is 
no  problem  for  a  character  recogniser  of  the  type  described  above.  You  just  get  rid  of  the  first  layer,  and 
have  the  edgelet  layer  as  the  input  layer.  So  the  stroke-information  is  translated  directly  into  activations 
of  each  of  these  edgelets,  by  software  outside  the  net,  and  then  exactly  the  same  network  can  be  used  to 
do  the  character  recognition. 

This  neural  net  recognises  charactas  based  on  the  juxtapositions  of  the  various  features,  (not  by 
merely  looking  at  sets  of  features,  e.g.  "it  has  a  cross  and  a  horizmtal  concavity").  If  the  goal  of  a  neural 
net  is  to  recognise  words,  this  idea  can  also  be  taken  further  by  allowing  the  neural  net  to  look  at  the 
characters  surrounding  a  given  character  in  order  to  help  categorise  it.  In  other  wrmls,  the  various 
activation  values  for  potential  characters  feed  into  a  'Sequent  English  syllable  recogniser*,  which  then 
outputs  a  single  identifier  or  goes  back  to  the  letter  level  to  output  the  individual  characters. 

If  we  go  up  the  next  level  to  word-root  or  wwd  recognition,  we  can  even  recognise  misspelt 
words,  just  as  we  recognise  distorted  characters. 

According  to  the  author's  theory,  tasks  such  as  text  recognition  and  speech  recognition  should  be 
done  in  this  way,  rather  than  by  a  process  of  hypothesis  testing.  Each  level  outputs  a  set  of  possibilities 
with  various  activation  values  into  the  next  level,  and  I  w  mwe  neurons  in  the  next  level  will  fire.  In 
other  words,  a  feed-forward  neural  net  is  sufficient  for  these  tasks.  The  fact  that  humans  can  recognise 
handwritten  text  pre-attentively  supports  this  theory. 

Of  course,  the  language  also  supports  recurrent  networks,  so  a  hypothesis-testing  model  could 
also  be  explored. 


Results: 

A  Connectarine  interpreter  and  development  environment  has  been  implemented,  in  C,  in  MS- 
DOS.  A  Connectarine  compiler  (compiling  Connectarine  to  C  (v  object-code)  is  planned. 

The  digit-recogniser  described  in  this  paper  has  been  implemented.  It  h^  not  been  tested  on  any 
public  database  of  character  bitmrqrs,  however  it  is  99%  accurate  on  one  set  of  1(X)  12x19  digit  bitmaps. 
The  first  SO  digit  bitmsqrs  were  used  to  develop  the  program,  and  the  remaining  ones  were  used  to  test  the 
program  with  no  further  develr^ment  This  full  database  is  reproduced  below,  so  the  reader  can  verify 
that  the  digits  represent  a  variety  of  possible  digit  bit-maps  and  the  recogniser  can  c(^  with  distcHtion 
and  noise  (as  well  as  translations  and  differences  in  size). 

This  program  consists  of  550  lines  of  Connectarine  code  (including  comments  etc).  The 
development  of  the  digit  recogniser  was  a  non-trivial  task,  and  one  requiring  the  auUKV  to  acquire  a 
certain  amount  of  expertise  in  the  methodology.  The  {HOgram  is  far  from  perfect,  but  also  far  from 
plateauing  out.  As  the  3  days  of  programming  progressed,  recognition  rates  improved  steadily  as  bitmap 
quality  decreased. 


in- 159 


o 1 

o  1  23A.SS7  as 
o  «  23'M5a789 
ot  a3  ^<5GT8  9 
O  ^  Z3  G  7  & 
o  1 

O  \  2  3<3i-SG7  8  «3 
O  J  ?.  3  4  S  <S  T  8  9 
O  <  2  3<il-5  6??  <S9 
t)  /  oa^i-^5e’7S^<^ 


In  other  words,  given  the  Connectaiine  develqMnent  environment,  a  general  tool  fw  all  fuzzy 
recognition  tadcs,  a  fully  functional  optical  digit  recogniser  was  created  after  just  3  days'  work. 


Relation  to  Other  Work: 

This  is  the  only  project  we  know  of  which  uses  the  idea  of  explicit  programming  of  neurons  and 
feature  detectors,  or  implementing  production  rales  as  neurtms.  There  has  been  a  lot  of  wwk  done  on 
hybrid  connectionist  expert/production  systems,  [Gal88]  and  [Kas90].  The  need  for  a  toolbox  of 
alternative  methods  is  expres^  in  [Kan92].  Connectaiine  could  be  viewed  as  a  fnward-chaining  Prolog 
with  connectionist  logic  (w  fiizzy  logic  which  is  also  supported)  and  with  arithmetical  unification. 
Ifowever,  the  issues  raised  by  the  language  are  not  related  to  traditional  logic  programming. 

Fukushima's  'Neocognitron'  deals  with  feature  detectors  and  pattern  recognition  in  a  similar  way 
[Fuk881,  [Fuk88b].  This  project  involved  the  creation  of  a  deep  neural  network  ct^ble  of  recognising 
hand-written  digits  in  a  translation-,  size-  and  distortion-  independant  way.  In  this  case,  the  network  was 
created  by  providing  it  with  training  sets  of  intermediate  features  at  several  levels,  and  having  it  learn  to 
recognise  these  given  the  processing  from  the  lower  levels.  There  was  an  explicit  architecture  of 
altenuiting  genoalisadon  and  categorisation  levels,  which  is  in  a  sense  how  this  digit  recogniser  works. 

CXher  anxoaches  to  handwritten  character  recognition  are  given  in  [Jak88]  and  [Nak92].  In 
these  systems,  a  single  layer  of  preprogrammed  feature  detectors  is  created  by  traditional  programming 
methods  and  then  fed  into  a  neui^  net.  This  first  layer  would  presumably  take  longer  to  develq)  without 
a  language  such  as  Connectarine  than  in  Connectaiine.  Also,  the  peifonnance  of  the  neural  net  fiom  that 
point  upward  might  suffer  fiom  being  a  shaUow  net,  when  the  problem  intrinsically  requires  a  deep 
neural  net  to  avoid  combinatorial  explosions. 

There  are  programs  which  allow  you  to  create  neural  nets  by  wiring  up  and  calibrating 
individual  neurons  or  tracts  of  neurons,  fiH'  example  MacBrain  for  the  Macintosh,  but  these  are  not 
suitable  for  sorious  use  on  a  neuron-by  neuron  basis.  These  systems,  and  numerous  other  systems,  allow 
one  to  configure  broad  connectivity  patterns,  but  the  idea  there  is  to  work  with  learning  rales  which  do  all 
the  real  woik. 


*  *  *  «  * 


There  are  many  ofqxxtunities  to  combine  this  kind  of  approach  with  other  ^^noaches  to  neural 
nets.  A  network  like  this  could  be  combined  with  trained  neural  nets  by  concatenating  neural  nets  along 
layers.  For  example,  Connectarine  might  be  better  at  doing  the  lower  levels  or  higher  levels  of  a 
perceptual  network.  Conversely,  Connectarine  could  benefit  from  training  algorithms  in  certain  ways. 
Connectaiine  could  also  be  u^  to  define  a  good  starting  point  for  a  learning  algorithm,  because  it 
creates  networks  that  are  already  close  to  some  kind  of  local  or  global  minimum. 
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There  are  not  many  techniques  f<x  creating  deep  neural  nets.  Connectarine  is  one  such 
technique.  Deep  neural  nets  are  ap{»opriate  for  tasks  such  as  this,  because  the  variety  of  transformations 
that  recognition  is  supposed  to  be  invariant  under  would  otherwise  lead  to  a  combinatorial  exsplosion. 
The  Neocognitron  is  one  such  system,  but  this  requires  the  collation  of  large  sets  of  intermediate  features. 

Connectarine  could  also  be  af^lied  to  other  categorisation  problems,  the  type  expert  systems  are 
mrne  commonly  used  for.  It  has  the  advantage  that  its  operation  is  not  opaque,  that  it  is  possible  to 
analyse  the  network  and  to  validate  that  it  will  not  generate  wildly  intqqKopriate  responses. 

Connectarine  has  enabled  the  author  to  study  the  theory  of  perception  in  a  neural  net  with  a  view 
to  explaining,  although  not  mimicking,  the  human  perceptual  ct^x. 

However,  the  system  in  itself  is  primarily  intended  as  a  serious  tool  for  performing  fuzzy  pattern 
recognition.  The  next  development  will  be  to  apply  it  to  another  domain,  such  as  speech  recognition  or 
recognition  of  pathologies  in  cytology  slides. 
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Ahstraci 

In  this  paper,  described  are  a  neural  network  model  and  system  implementation  for  recognizing 
characters  extracted  from  license  plates  of  moving  vehicles.  A  method  for  selecting  features 
appropriate  for  recognition  is  proposed  based  on  relationships  between  character  features  and 
recognition  rate,  and  an  enhanced  back-propagation  algorithm  is  also  proposed  which  effectively 
selects  training  patterns  and  djnamically  modifies  learning  rate.  Based  on  the  proposed  algorithm,  a 
character  recognition  system  for  license  plate  is  developed  and  tested  against  real  data.  In  the  test 
perfiirmed  on  vehicles  running  on  the  roads,  the  system  demonstrated  recognition  rate  higher  than  95 
percent. 


1.  Introduction 

Neural  n^orks  have  been  successfully  employed  in  various  applications  of  pattern  recognitions.  Especially, 
neural  network  has  demonstrated  good  performance  in  recognizing  noise-stained  characters  and  hand-written 
characters,  and  thus  neural  network  implementation  might  be  appropriate  for  recognizing  characters  extracted  from 
moving  vehicles[5, 6, 7].  Real-time  recognition  of  characters  for  vehicle  license  plates  is  very  difficult,  as  the  size  of 
characters  varies  depending  tqxm  the  position  of  the  inuge  extraction,  motion  of  the  camera,  and  speed  of  the 
vehicle.  The  requited  complexity  of  the  system  is  very  close  to  that  of  recognizing  hand-written  characters. 

Researches  on  developing  neural  rretwotk-based  character  recognition  systems  have  mainly  used  features 
extracted  based  on  heuristics.  However,  the  feature  extraction  methods  currently  being  used  have  not  proved  their 
validity  through  systematic  analysis,  and  the  features  tend  to  lose  its  distiiKtive  features,  because  of  the  uncertainty 
involved  in  feature  extraction  and  overlapped  features.  In  addition,  the  number  of  features  is  so  large  that 
recognition  requires  enormous  computation  time  and  thus  makes  it  almost  impossible  to  implement  on  current 
hardwares. 

Back-propagation  has  been  known  to  be  useful  in  training  multi-layered  neural  networks.  However, 
disadvanta^  of  the  algorithm  are  that  it  requires  a  large  cotrq)utational  time  for  training,  possibly  converges  into 
local  minimum,  and  forgets  ineviously  learned  weights  in  the  process  of  training[3, 4, 9]. 

In  this  research,  a  back-propagation  algorithm  is  employed  to  recognize  characters  in  the  license  plate  of 
moving  vehicle  in  real-time.  Input-node  removal  method  is  proposed  as  an  effective  way  of  extracting  features  from 
the  object,  and  an  enhanced  back-propagation  algorithm  is  employed  based  on  selective  training  pattern  to  prevent 
oscillation  and  to  facilitate  fast  convergence.  The  proposed  methods  were  successfully  employed  in  recognizing 
characters  in  the  license  plate  of  moving  vehicles. 


2.  Neural  Network  Design  for  Character  Recognition 

Character  recognition  system  based  on  neural  networks  consists  of  four  sequential  steps;  preprocessing,  feature 
extraction,  training,  and  recognition.  Preprocessing  step  is  comprised  of  segmentation,  noise  filtering,  and 
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nonnalizadonl?,  9],  Feature  extraction  is  a  process  to  detennine  how  to  explicitly  describe  the  character  pattern, 
generally  by  constructing  feature  vectors  representing  characters,  and  thus  have  a  strong  influence  on  overall 
recognition  rate.  Improperly  selected  features  frequently  lead  to  low  recognition  rate  and  require  complex 
recognition  algorithms[l,  7, 9], 

Analysis  on  the  relationship  between  features  and  overall  recognition  rate  in  back-propagation  algorithm  can 
lead  to  performance  improvement  of  the  recognition  system  by  selecting  relevant  features.  As  the  features  propagate 
its  influence  up  to  the  output  nodes,  a  direct  relationship  between  input  features  and  recognition  rate  can  be  analyzed 
by  activating  neural  networks,  purposefully  removing  relevant  input  ttodes,  and  then  measuring  recognition  rate. 
Although  an  atterrqrt  has  been  ntade  to  measure  the  influence  of  input  features  by  analyzing  distributions  of  weights 
in  hidden  layers[2],  the  interpretation  of  weight  vectors  is  very  difficult. 

Removal  of  irrelevant  itq>ut  rrodes  can  be  done  by  setting  values  of  those  iiqrut  nodes  to  zeros,  which  in  turn 
sets  the  weighted  sununation  of  hidden  nodes,  net^(=2lw^x^  zero  and  thus  lead  to  none  contribution  to  output 
values.  Accordingly,  we  propose  that  the  influence  of  extracted  features  on  recognizing  characters  can  be  known  by 
the  analysis  of  the  relationship  between  the  removed  input  node  and  the  recognition  rate,  thus  system  performance 
can  be  improved  by  effective  feature  selectioa 

— Box-1  :  Enhanced  Back-propagation  Algorithm - 

Step  /:  Selecdve  Learning 

current Jss  =  0.0; 

Loop  number  ofjnput  for  each  pattern 
compute_actual_outputO: 
compute  errorQ: 

current _tss  =  current  Jss  +  current  error; 

If  current  error  >  average_error  Then 
adjust _weights(): 

end  if 
endofjoop 

Step  II:  Adapt  Learning  Rate 

average jrror,  =  current  Jss  /  number jofjnput; 
delta _error  =  average _error^,  -  averagejerror- 
tolerance  »=  -  number jjfoutput  *  number jjfJnput  /  E6; 

If  delta jerror  >  tolerance  Then 

oscillation  =  oscillation  +  I; 

end  if 

oscillation  criterion  =  epoch  mod  number  j>f  input; 

If  oscillation  criterion  =  0  Then 

learning  rate  -  initial  Jearning  rate  *  oscillation  /  number  of  input; 
oscillation  =  0; 

end  if 


In  back-propagation  algorithm,  the  dynamic  modification  of  learning  rate  and  the  effective  selection  of  training 
patterns  can  improve  system  performance.  The  effective  selection  of  training  patterns  can  be  done  in  the  following 
qierations;  div^  the  total  error  sums  obtained  in  the  forward  pass  by  the  number  of  training  patterns  to  calculate  an 
average  error,  and  then  train  patterns  which  have  error  larger  than  average  error  in  the  backward  pass.  This  process 
is  described  in  step  I  in  Box-1.  In  general,  training  the  partial  set  of  input  patterns  would  lead  to  oscillation  and  it 
can  be  solved  by  rfynamic  modification  of  learnit^  rate.  Dynamic  modification  of  learning  rate  can  be  implemented 
by  countirig  the  number  of  oscillation  when  the  error  size  increases  over  the  predetermined  range,  reflecting  the 
ratios  in  setting  learning  rate,  and  eventually  decreasing  learning  rate  in  case  of  oscillations,  while  increasing 
learning  rate  in  case  of  convergence,  to  get  the  faster  convergence  of  the  training.  Box-1  describes  enhanced 
back-propagatkm  katning  algorithra 

In  the  enhanced  learning  algorithm,  tolerance  value,  in  addition  to  the  initial  learning  rate  and  momentum,  must 
be  determined  to  measure  error  increase  for  detecting  oscillatioa  Also,  periods  of  modifying  learning  rate  should  be 
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detenniiied.  In  the  algorithm,  the  karning  rate  should  be  modified  whenever  the  number  of  epochs  increased  is  equal 
to  the  nundier  of  tiaining  patterns,  as  described  in  step  II  in  Box-1.  Selective  training  patterns  might  reduce 
computation  complexity  in  each  epoch,  and  dynamic  modification  of  learning  rate  might  effectively  prevent 
oscillation  frequently  encountered  in  back-propagation  algorithm.  Both  mechanisms  might  improve  convergence 
speed  and  degree  of  generalization.  In  the  applications  of  real  world  problems,  the  enhanced  back-propagation 
aigorithm  effectively  solved  a  set  of  conqrlex  tiaining  patterns.  In  case  of  easy  training  patterns  the  number  of 
qx)chs  in  the  enhanced  algorithm  tends  to  become  larger  than  in  the  conventional  algorithm.  However,  the  total 
training  time  in  enhanced  algorithm  was  reduced  through  effective  decreases  of  computation  time  each  epoch 


3.  Recognizing  Characters  in  Vehicle  License  Plate 

Currently,  in  Korea  the  vehicle  license  plate  consist  of  eight  characters,  except  some  special  use.  The  first  two 
characters  in  tte  upper  row  are  Hangul(Korean  script)  characters  (geographical  regions),  and  the  third  character  in 
the  upper  row  is  a  luimeral  (vehicle  class).  The  first  character  in  the  lower  row  is  a  Hangul  character,  while  the  next 
four  characters  in  the  lower  row  are  four  numerals  as  in  the  Figure-1.  Part  (a)  in  Figure-1  depicts  a  license  plate  in 
512*480  gn^-scaled  vehicle  image  captured  by  CCD  camera,  while  part  (b)  depicts  digitized  result  of  licer^  plates 
after  operations  of  segmentation  and  preprocessing  for  each  character.  The  resolution  ratio  of  pteprocessed  character 
strings,  except  four  numerals  of  relatively  large  size  in  the  lower  row,  is  very  low,  which  imposes  complexities  and 
inherent  difficulties  on  pattern  recognition.  In  the  review  of  200  images  of  license  plates  in  the  experiment  it  was 
found  that  the  size  of  Hangul  characters  in  the  upper  row  varies  from  13*  IS  to  27*29  pixels,  and  the  size  of  Hangul 
characters  in  lower  row  takes  various  ranges  of  16*23  to  38*47  pixels.  The  size  of  numerals  varies  too;  numerals  in 
the  uiqier  row  change  the  size  from  8*17  to  20*24  pixels,  while  numerals  in  lower  row  change  from  10*49  to  29*60 
pixels. 


a)  License  Plate  in  Vehicle  Image  b)  Segmentation  and  Digitized  Result 


Figure- 1 :  Korean  Vehicle  License  Plate 

The  first  two  Hangul  characters  in  the  upper  row  indicate  geogr^hical  region,  one  of  the  six  registration  cities 
and  nine  registration  proviiKes.  Five  out  of  the  six  types  of  Hangul  ^]]able[8]  are  included  in  the  conqxisition  of  the 
two  characters.  The  resolution  of  two  characters'  image  is  so  low  that  it  is  very  difficult  to  separate  vowels  and 
consonants,  and  to  extract  features  via  Bar  Maricing.  When  combining  two  char^ters  into  a  complete  pattern  after 
separately  r^gnizing  individual  character  was  attempted,  the  uncertainty  exponentially  increases.  Therefore, 
method  of  simultaneous  recognition  of  two  characters  as  a  single  pattern  was  employed.  The  first  Hangul  character 
in  the  lower  row  has  the  structure  of  "consonant  +  vowel",  generating  84  different  character  patterns.  In  the  Hangul 
character  recognition,  a  sirtgle  character  is  divided  into  vowel  and  consonant.  A  structural  method  was  enqiloyed  for 
vowel  recognition,  while  neural  network  was  employed  for  consonant  recognitioa  The  vertical  consonants  of 
Hangul  characters  are  written  in  the  sinqrlified  form  but  horizontal  consonants  are  written  in  the  cursive  style.  It  is 
advisable  to  divide  consortants  into  horizontal  consonants  and  vertical  consonants,  and  to  recognize  each  of  them. 
Numerals  are  comprised  of  10  different  patterns  from  0  to  9,  and  it  is  more  elective  to  divide  numerals  into  small 
size  mimfitals  and  large  size  numerals  and  recognize  each  of  them.  Figure-2  depicts  Hangul  characters  and  numerals 
which  are  used  in  Korean  vehicle  license  plates. 
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In  this  research  of  neural  network  application,  constructed  are  five  neural  networks,  each  for  geographical 
regions,  small  size  numerals,  large  size  numerals,  vertical  consonants,  and  horizontal  consonants  of  Hangul 
characters.  Out  of  300  license  plates  extracted  from  vehicles  running  on  the  roads,  200  license  plates  were  used  for 
training  and  another  100  license  plates  were  used  for  testing  recognition  rate.  For  the  purpose  of  network 
architecture,  the  number  of  irqnit  nodes  was  determined  by  the  features,  the  number  of  hidden  nodes  was  by 
erqierienoe,  and  the  number  of  output  nodes  was  by  the  number  of  output  codes. 


4.  Experiments  and  Results 

In  the  eigieriments,  trainirtg  was  done  until  recognition  of  training  patterns  was  100  percem  correct,  and  then 
variations  of  recognition  rate  with  removals  of  input  nodes  was  investigated.  Investigation  on  the  geographical 
region  codes  showed  that  15  features  out  of  34  features,  when  they  were  removed,  did  not  have  a  strong  impact  on 
recognition  rate,  generally  less  than  1  percent.  In  particular,  S  features  had  completely  no  influence  on  the 
recognition  rate.  However,  when  all  of  the  15  features  were  removed  simultaneously,  the  recognition  rate  was 
lowered  to  59  percent  and  therefore  it  was  revealed  that  the  overall  recognition  rate  was  greatly  influenced  by  the 
removal  of  irrelevaru  nodes.  The  recognition  rate  was  easily  recovered  by  adjusting  weights  with  a  small  ruimber  of 
iterations.  With  the  respect  of  feature  selection,  only  small  nurtd>er  of  training  patterns  can  effectively  determine 
features  of  trainable  patterns.  Table-I  shows  the  architecture  of  neural  network  employed  in  the  experiment,  tuunber 
of  patterns,  the  rmmber  of  features  for  each  character,  and  test  results  of  characters  recognition  of  vehicle  license 
plates. 


Table- 1:  Experimental  Results  of  Characters  Recognition 


Items 

Characters 

Neural  Nets 
Topology 
Cmput*hidden 
^output) 

Experimental  Patterns 

Feature  Selection 

Recognition 
Rate  (%) 

Number  of 

Training 

Patterns 

Number  of 

Test 

Patterns 

Number  of 
Featutres 

Number  of 
Removable 
Features 

Number  of 
Detemined 
Features 

Geographical  Region 

29  *  29  *  15 

200 

100 

34 

15 

96.0 

Small  Size  Numeral 

25  *  25  •  10 

200 

100 

37 

15 

95.0 

Vertical  Consonant 

14  *  14  ♦  14 

152 

78 

16 

6 

14 

94.9 

Horizontal  Consonant 

14  ♦  14  ♦  14 

48 

22 

16 

14 

90.9 

Large  Size  Numeral 

25  *  25  •  10 

800 

400 

37 

■■ 

25 

98.3 

In  the  Tdble-1,  the  number  of  removable  features,  having  ru  impact  on  overall  recognition  rate  even  with  the 
removal  of  the  features,  was  determined  in  the  begituung  step  considering  that  even  simultaneous  removal  of  some 
features  would  not  have  atiy  impact  on  recognition  rate.  In  case  of  numeral  recognition,  relatively  marry  features 
could  be  removed  without  a  sigr^cant  impact.  However,  in  case  of  Hangul  characters  and  regional  codes,  because 
of  the  low  resolution,  all  the  features  showed  relevata  to  the  overall  recognition  rate. 

In  this  research,  the  enhanced  back-propagation  algorithm  was  employed  for  neural  network  applicatioa  The 
algorithm  was  useful  and  effective  for  overcome  local  minima  reach^  possibly  when  the  region^  code  "^7| 
(Kyrmggi)”  was  trained  as  "^■^<Kyungbuk)"  for  similar  character  pattern,  and  for  speeding  up  the  convergence. 
The  newfy  developed  algorithm  demonstrated  much  improved  performance  level  when  applied  to  complex  and 
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difiKult  patterns,  biU  showed  almost  equal  peiformance  to  conventional  algorithms  when  applied  to  easy  patterns.  It 
was  fou^  that  the  toleianoe  value  us^  for  determining  error  increase  to  detect  oscillations  is  very  sensitive  to 
learning  qieed.  The  degree  of  generalization  was  much  enhanced  and  better  results  were  demonstrated  in  the 
experiment  when  the  training  patterns  were  uniformly  distributed  and  the  size  was  large. 


S.  Conclusion 

In  this  ptqter,  an  enhanced  back-propagation  algorithm  and  feature  selection  method  are  described  for 
improving  oortvergence  speed  and  were  employed  in  an  character  recognition  systems  of  license  plates  extracted 
from  moving  vehicles.  The  proposed  algorithm  can  enhance  system  performance  through  effective  feature 
extraction,  reduction  of  recognition  time  and  learning  speed,  and  increased  recognition  rate.  The  vehicle  license 
plates  recognition  system  implemented  in  T800  Transputer-based  environment!  10]  showed  that  the  time  for  a 
complete  recognition  required  only  0.09-0. 1 1  seconds  after  digitized  and  segrricntation,  which  was  not  problematic 
in  real  world  ai^lications.  The  recognition  rate,  even  though  varied  depending  on  the  results  of  preprocessing  step, 
was  getKtally  3bo\e  95  percent,  relatively  high  performance.  This  research  requires  further  e^qreriments  on  training 
more  sanqrles,  especially  needs  further  research  on  new  featirre  extraction  methods  and  improving  the  degree  of 
generalizatioa 
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ABSTRACT 

A  dual-use  neural  network  technology,  called  the  Statistical  -  Multiple  Object  Detection  and  Location 
System  (S-I^DALS),  has  been  developed  by  Booz> Allen  &  Hamilton.  Inc.  over  a  5-year  period,  funded  by  various 
Air  Force  organizadcMis  for  Automatic  Target  Recr^ition  (ATR).  This  conference  p^r  will  detail  improvements 
in  the  MODALS  neural  networic  architecture  that  ted  to  the  Statistical  -  MODALS  architecture  which  has  a  natural 
extension  to  multi-sensor  fusion  (Visible,  IR,  SAR)  and  multi-look  evidence  accrual  for  tactical  and  strategic 
lecotuiaissance.  Since  S-MODALS  is  a  learning  system,  it  is  readily  adaptable  to  object  recognition  prtAlems  other 
than  AIR  as  evidenced  by  this  S-MODALS  investigation  into  the  automated  database  quay  of  DRUGFIRE  ftxensic 
imagery.  The  pattern  matching  problem  of  micro»X)pic  marks  for  DRUGFIRE  shell  casings  is  analogous  to  the 
pattmn  matching  problem  of  targets  for  the  Visible  component  of  the  S-MODALS  design.  That  is  to  say,  the 
jdiysics;  phenonienology:  discrimination  and  search  strategies;  robusmess  requirements;  and  aror  level  and  confidence 
let^  propagation  are  all  of  a  similar  nature. 

1.0  Overview  of  the  MODALS  Approach 

The  Multiple  Object  Detection  and  Location  System  (MODALS)  distinguishes  itself  fiom  many  classical 
Automatic  Target  Recognition  (ATR)  approaches  because  it  simultaneously  detects,  segments,  and  identifies 
multiple  targets  in  an  image  [1-4].  Classical  ATR  approaches  carry  these  operations  out  independently  in  a 
sequential  fashion.  This  sequenti^y  independent  approach  can  result  in  mistakes  at  each  step,  thus  reducing  the 
overall  system  performance  to  the  product  of  the  pofmrmances  of  each  stq).  MODALS  consists  of  three  layers  each 
working  tog^o'  to  bring  about  a  target  identification.  The  conceptual  methodology  behind  MODALS  is  quite 
ample:  find  features  that  distinguish  the  training  targets  from  each  other,  associate  the  positions  of  these  features  for 
each  training  target,  and  try  to  1^  the  same  set  of  target  features  in  the  same  geometry  in  the  test  images. 

Based  on  a  training  target,  MODALS  learns  a  .<«t  of  distinct  features.  Several  features,  small  squares  of 
pixels,  ate  kamed  for  each  training  target  poapective.  These  features  are  used  to  estimate  whether  a  target  occurs  at  a 
given  location.  However,  features  must  be  successfully  acquired  that  are  effective  for  both  discriminating  one  target 
fiom  another  and  discriminating  the  targets  from  background.  This  discrimination  capability  is  accomplished  in 
I^DALS  using  a  neural  netwwk  learning  technique.  After  learning  the  features,  each  feature  type  and  feature 
location  is  associated  with  the  desired  target  type  to  produce  a  final  neural  network  ATR  perfmmance  system, 
MODALS. 

The  M<.  >';ALS  performance  network  consists  of  three  layCTs:  the  Feature  Extraction  Layer,  the  Spatial 
Location  Layer,  and  the  Object  Detection  Layer.  The  Feature  Extraction  Layer  computes  matches  between  each 
feature  and  all  areas  of  the  test  image  using  a  similarity  metric.  The  Feature  Extraction  Layer's  similarity  metric  is 
based  cn  minimizing  the  error  between  the  learned  feature  and  the  test  image.  The  similarity  metric  is  invariant  to 
changes  in  ambient  illumination  and  robust  to  changes  in  directional  illumination,  rotation,  scaling,  and  noise.  The 
Spat^  Location  Layer  locates  the  best  feature  matches  in  locations  close  to  where  features  were  learned  in 
training.  These  locations  are  defined  by  Spatial  Location  Layer  masks.  The  size  and  shape  of  each  mask  is  a 
function  of  the  desired  amount  of  robusmess  to  rotation  and  scale.  The  Spatial  Location  Layer  produces  both  the 
position  and  value  of  these  feature  matches  for  all  possible  target  locations.  The  last  layer  of  the  MODALS  system, 
the  Object  Detection  Layer,  is  responsible  for  target  detection  and  identification.  Feature  and  position  information 
corieqKNiding  to  each  training  target  pospective  is  combined  to  accumulate  detection  evidence  for  a  training  target 
per^KCtive.  In  paralld  to  the  combination  of  feature  errors,  the  Object  Detection  Layer  also  computes  whether  the 
poations  of  the  best  feature  matches  correspond  to  a  valid  rigid  body  rotation.  This  computation  involves  the 
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galriilarinii  of  a  Positional  Nfean  Square  Error  (PMSE).  a  measure  of  how  much  the  locations  of  the  detected  features 
CQcreqxHid  to  a  rigid  body  rotation  of  the  training  target.  The  combination  of  feature  errors  and  positional  errors 
results  in  an  enor  measure  for  every  training  target  perspective  at  every  possible  target  location.  This  error  measure 
can  be  interpreted  as  the  overall  emv  between  the  image  at  that  location  and  each  training  target  penpective.  The 
Object  Detection  Layer  evidence  for  each  training  target  perspective  is  spatially  competed,  resulting  in  one  or  more 
spatially  separated  potential  target  detections.  After  the  training  target  perspective  competition,  an  Object  Detection 
threshedd  controls  which  target  error  measures  are  low  enough  to  be  considered  detections.  This  threshold  shifts  the 
ATR  system  operatiitg  characteristics  along  the  classic  probability  of  detection/false  alarm  ROC  curve. 

2.0  Probabilistic  Modification  of  MODALS 

The  Feature  Extraction  Layer  similarity  metric  is  a  measure  of  each  feature's  normalized  mean  square  error 
(mse).  To  obtain  the  training  target  perspective  output  response,  MODALS  simply  averages  the  normalized  mean 
square  errors  across  the  features  in  a  training  perspective.  This  evidence  accrual  method  produces  the  Object 
Detection  Layer  ouqnit  Difierent  objects  consist  of  different  numbers  of  features  and  averaging  in  enot  space  was 
the  method  chosen  to  measure  overall  target  error  in  a  consistent  way.  However,  this  evidence  accrual  method  has 
two  drawbacks.  First,  it  does  itot  take  into  account  the  different  statistics  of  different  features.  For  instance,  an  mse 
value  of  0.1  for  a  complicated  feature  may  be  highly  indicative  of  a  target,  whereas  the  same  mse  value  may  be  a 
very  common  response  to  background  for  a  relatively  uniform  feature.  Second,  averaging  a  few  poor  mse  values 
with  many  good  mse  values  can  have  an  unwarranted  effect.  These  few  outlie  can  bias  the  Object  Detection  output 
to  the  point  where  it  equals  the  average  of  many  mediocre  mse  values  caused  by  the  background.  One  of  the  most 
evident  results  of  this  combination  anomaly  was  the  initial,  relatively  poor  performance  of  MODALS  on  occluded 
targets  versus  uiKKCluded  targets  [1].  An  evidence  accrual  method  was  needed  which  would  compensate  for  the 
differences  in  the  statistics  of  different  features,  and  would  not  allow  a  few  poor  mse  values  (caus^  by  occluded 
features;  to  spoil  the  Object  Detection  ouqiuL 

Ideally,  the  Feature  Extraction  output  would  represent  the  probabilities  that  each  feature  is  present,  and  the 
Object  Detection  ouqiut  would  represent  the  probability  that  the  target  is  present.  If  this  was  the  case  then,  given 
certain  assumptions.  Probability  Theory  provides  the  following  equation  for  combining  the  Feature  Extraction 
ouQMit: 


PO  =  l-n(l-PFj) 

PO  is  the  probability  that  the  target  is  present,  Ppj  is  the  probability  that  feature  j  is  present,  and  j  is  1  to  N 
features. 

To  interpret  this  equation,  remember  that  Ppj  is  the  probability  that  feature  j  is  present,  so  (l-Ppj)  is  the 
probability  that  feature  j  is  absent  TKl-Ppp  is  the  probability  that  all  features  are  absent  (assuming  independence). 
So,  l-nCl-Ppj)  is  the  (RobabOity  that  not  all  features  are  absent  (i.e.  at  least  one  feature  is  present).  Therefexe,  if  we 
can  determine  the  {Kobability  that  each  feature  is  present,  we  have  a  method  for  detennining  if  the  target  is  present 

Can  we  determine  Ppj  given  the  mse  output  of  the  Feature  Extraction  similarity  metric  (fj)?  In  Probability 
Theory,  this  is  called  a  conditional  probability  and  is  given  by  the  fcxmula: 

P(Fjlfj)=P(fjlFj)P(Fj)/P(fj) 

where  P(Fj  I  fj)  is  the  probability  that  feature  j  is  present  given  an  mse  of  fj,  P(fj  I  Fj)  is  the  probability  distribution 
of  fj  if  the  feature  is  actually  present,  P(Fj)  is  the  probability  that  the  feature  is  actually  present  regardless  of  the  mse 
value  (Le.  the  target  density),  and  P(fj)  is  the  probability  distribution  of  fj  regardless  of  whether  the  feature  is  present 
or  not 


All  measured  mse  values  can  be  used  to  approximate  the  distribution  P(fj).  However,  only  those  mse 
values  measured  in  the  presence  of  the  feature  can  be  used  to  approximate  the  distribution  P(fj  I  Fj).  In  (»der  to 
accurately  approximate  die  function  P(fj  I  Fj),  it  is  necessary  to  have  a  large  number  of  examples  of  each  feature  for 
training.  This  implies  a  large  numbm'  of  examples  of  training  target  perspectives.  In  its  current  ftnm,  MODALS 
requires  only  a  sniall  numbo'  of  training  target  perspectives.  Tliis  characteristic  is  one  of  its  advantages.  It  is  both 
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undesitable  and  impractical  to  attempt  to  directly  approximate  the  function  P(fj  I  Fj)  by  increasing  the  number  of 
training  images  lequiied.  Something  other  than  I  fj)  must  be  used  to  compute  the  Object  Detection  output 

As  mentioimd  above,  P(fj)  is  a  distribution  which  is  fairly  easy  to  approximate,  however,  it  is  a 
probabili^  distribution  and  not  a  raw  probability.  P(mse  <  fj),  called  the  cumulative  probability,  is  the  integral  of 
P(fj)  from  zero  to  fj.  P(mse  <  fj)  represents  the  in'obability  of  observing  an  mse  value  less  than  fj  (regardless  of 
whether  die  feature  is  present  or  not).  Since  the  features  are  taken  from  the  target  and  die  features  are  rare  in  the 
background,  the  mse  and  P(mse  <  fj)  will  be  lower  when  the  feature  is  present  than  when  it  is  absent.  The 
S-l^DALS  evidence  accrual  method  uses  (as  the  Object  Detection  output)  the  probability  of  observing  a  set  of  mse 
values  which  are  less  than  the  measured  set  of  mse  values : 

o  =  n  P(mse  <  fj) 

Here,  a  high  Object  Detection  output  indicates  that  doing  better  than  all  the  measured  mse  values  is  fairly  comnitNi. 
A  low  Object  Detection  output  indicates  that  doing  better  than  all  the  measured  mse  values  is  ratho'  uncommon. 
So,  low  Object  Detection  outputs  indicate  targets,  while  high  Object  Detection  outputs  indicate  background.  Note 
that  a  P(mse  <  ^  close  to  one  (e.g.  from  an  occluded  feature)  will  have  little  effect  mi  the  {sroduct,  whereas  a  very 
small  P(mse  <  fj)  will  have  an  extreme  effect 

The  advantage  to  using  P(mse  <  fj)  instead  of  the  mse  value  fj  can  be  seen  in  the  following  example. 
Consider  two  features  Fi  and  F2  boA  taken  from  the  same  target  perqiective.  F]  is  not  very  distinctive  (e.g.  alniost 
entirety  uniform),  and  F2  is  very  distinctive  (e.g.  a  comer  or  edge).  Clearly  an  mse  value  of  0.1  for  Fi  is  much  less 
indicative  of  a  target  than  an  mse  value  of  0.1  for  F^.  Ihe  probidiility  of  getting  an  mse  value  less  than  0.1  for  Fi 
might  be  90%  whereas  the  probability  of  getting  an  mse  value  less  than  0.1  for  F2  might  be  5%.  hforeover,  the 
probability  of  both  mse  values  being  less  than  0.1  is  0.05*0.90=4.5%.  Note  that  combining  the  mse  values  directly 
would  have  meant  assigning  equal  confidence  to  the  presence  of  both  features,  when  in  fact  their  is  much  less 
confidence  in  the  iHesence  of  Fi  than  F2. 

Until  now  in  this  discussion,  we  have  been  assuming  that  the  Spatial  Location  Layer  mask  is  only  one 
{Mxel  (le.  there  is  no  difference  between  the  Spatial  Location  ouqiut  and  the  Feature  Extraction  ouqwt).  We  will 
now  remove  this  assumption.  In  the  original  fcarmulation  of  MODALS,  the  Spatial  Location  ouqmt  was  simply  the 
minimum  Feature  Extraction  mse  over  some  elliptical  region.  If  we  continue  to  do  this,  a  method  is  needed  for 
appropriately  combining  Spatial  Location  values.  We  can  not  combine  Spatial  Location  values  directly,  but  we  can 
combine  the  probal^ty  getting  lower  than  a  Spatial  Location  value  the  same  way  we  comNned  the  probability  of 
getting  lower  than  an  Feature  Extraction  value. 

o  =  n  P(min  mse  <  sj) 

How  is  P(min  mse  <  sj)  rdated  to  P(  mse  <  fj)  if  sj  is  the  minimum  of  fj  over  some  region?  P(min  mse  <  sj)  is 
the  probability  that  the  minimum  mse  is  less  t^  the  observed  minimum  mse  (i.e.  sj).  This  would  be  true  if  one 
or  mote  mses  in  the  regitm  were  less  than  sj.  Put  another  way,  this  would  be  true  if  not  all  mses  in  the  region  ate 
more  than  sj.  The  probability  that  any  one  mse  in  the  region  is  greater  than  sj  is 

1-P(mse  <  Sj) 

The  probability  that  all  mses  in  the  region  are  more  than  sj  is 

(1-P(mse  <  sj))^i 

where  Nj  is  the  numbn  of  pixels  in  the  region  for  feature  j.  The  probability  that  not  all  mses  in  the  region  are 
greater  ttosj  is 


P(min  mse  <  sj)  =  1-(1-P(mse  <  sp)^j 
1 

so,  the  formula  for  the  Object  Detection  output  becomes  (Note  that  this  degenerates  to  our  earlier  formula  when  Nj 
equals  one.) 
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o  =  n  (l'(l-P(inse  <  sj))Nj) 


The  above  fiomula  iqxesents  the  theoretical  basis  for  the  Statistical  -  MODALS  evidence  accrual  technique. 
This  evidence  accrual  method  compoisates  for  the  differences  in  the  statistics  of  different  features,  and  does  not  allow 
a  few  poor  mse  values  (caused  by  occluded  features)  to  spoil  the  Object  Detectkm  ouquiL  Figure  1  shows  a  set  of 
rquesentadve  S-MODALS  test  imagery  and  Figure  2  shows  the  improvement  of  S-MODALS  versus  MODALS  on 
occluded  targets.  The  same  evidence  accrual  technique  can  also  combine  evidence  from  different  senstws,  different 
rqxesentatioas,  and  over  time.  This  formula  provides  S-MODALS  with  the  ability  to  perform  multi-sensor  and 
muld-kxjk  evidence  accrual  for  tactical  reconnaissance. 

3.0  DRUCnRE 

Recognizing  the  benefits  which  could  be  derived  from  the  application  of  state-of-the-art  computer 
technologies  to  the  discipline  of  forensic  Hrearms  identification,  the  FBI  has  developed  the  DRUGFIRE  system. 
DRUGFKE  is  presently  a  database  driven  multimedia  imaging  system  which  significantly  increases  the 
effectiveness  of  forensic  laboratories  in  maintaining  and  searching  either  their  own  individual  or  sha^  muld-agaicy 
unsolved  case  firearms  evidence  files.  In  the  latter  case,  this  new  technology  suiqxxts  the  establishment  of  regionai 
computerized  firearms  evidence  clearinghouse  operations  which  facilitate  the  sharing  and  linking  of  forensic 
information  between  regionally  clustered  forensic  laboratories.  In  so  doing,  it  materially  extends  the  caqiabilities  of 
forensic  firearms  identification  examiners. 

Since  the  introducticMi  of  the  ballistic  comparison  microscope  in  1925,  which  allows  side-by-side 
examination  of  the  microscopic  marks  on  two  bullets  or  cartridge  cases,  firearms  examinations  have  been  limited  to 
the  simultaneous  comparison  of  two  specimens  on  the  same  microscope.  DRUGFIRE  seamlessly  integrates  a 
relational  database,  vidM,  digital  image  processing  and  manipulation,  audio  and  telecommunication  technologies  in  a 
maiuier  which  emulates  augments  the  functions  of  the  ballistic  comparison  microscope.  In  the  old  microscopic 
technique,  two  cartridge  cases  are  presented  in  the  same  field  of  view  under  the  microscope.  An  optical  hairline 
divides  the  two  cartridge  case  images  from  each  other.  When  similar  microscopic  markings  are  found  on  both 
cartridge  cases,  the  analyst  can  ovo’lay  the  two  images  in  an  attempt  to  create  a  continuour  '•nage  of  the  particular 
marks  being  anialyzed.  The  overlay  is  accomplished  by  manipulating  the  optical  hairline.  c&.  .dge  case  holders  and 
lights  of  the  comparison  microscopes.  The  DRUGFIRE  system  digitally  emulates  this  manual  process  on  the 
woifcstatian  monitor.  Through  the  use  of  software,  digital  and  video  cartridge  case  imagery  can  be  manipulated  in 
the  same  fashion  as  having  the  two  physical  shell  casings  under  the  microscope.  The  software  permits  the  scaling, 
rotating,  and  translating  of  the  imagery,  as  well  as  edge  enhancement  and  ccHitrast/brightness  control.  The  digital 
images  ^  the  microscapic  marks  on  r^resentative  fired  cartridge  cases  and  shotshell  casings  are  stored  in  a  relational 
database  and  linked  to  the  appropriate  alphanumeric  encodings  for  the  descriptive  fiwensic  firearms  identification 
charactRistics  such  as  caliber  type,  firing  pin  impression  type,  and  breech/bolt  face  mark  type.  The  images,  which 
dq»ct  the  highly  rqjcoducible  microscopic  features  that  cannot  be  e^ectively  classified  by  ^phanumeric  encodings, 
are  captured  in  accndance  with  standanlized  DRUGFIRE  system  formats  and  protocols  and  are  annotated  so  as  to 
indkate  their  orientation  and  die  presence  of  distinctive  features.  A  database  query  based  on  the  descriptive  fcaensic 
firearms  identification  characteristics  returns  the  cartridge  case  images  in  a  die  (5  images  x  S  images)  format  fw  the 
analyst  to  visually  inspect  and  select  the  most  similar  images  for  mc»e  comprehensive  side-by-side  comparison. 

It  has  been  recognized  that  the  effecdveness  of  the  DRUGFIRE  system  could  be  substantially  imjnoved 
through  addidonal  automadon  of  the  querying  process.  The  quRying  of  the  database  and  the  searching  of  a  large 
number  of  images  is  sdll  a  rather  labor-intensive  process  in  the  present  system.  This  automated  enhancement  will 
greatly  increase  the  effecdveness  of  searching  the  microscopic  marks  in  the  DRUGFIRE  system,  particularly  when 
sttned  image  files  become  voluminous  and  difficult  to  search  using  only  the  descripdve  forensic  firearms 
idendficadon  characterisdcs  of  the  current  DRUGFIRE  system.  For  the  DRUGFIRE  applicadon,  the  pattern 
recognition  capabilides  of  S-MODALS  direcdy  exploits  the  highly  reproducible  microscopic  marks  that  cannot  be 
effectively  clarified  by  alphanumeric  encodings  unlike  the  quantitative  descripdve  forensic  firearms  idendficadon 
characteristics. 


4.0  DRUGFIRE  Image  Database  Query  With  S-MODALS 

In  conjunction  with  the  FBI,  Booz« Allen's  Advanced  Computational  Technologies  Practice  investigated  the 
triplication  of  S-MODALS  to  the  imagery  data  from  the  DRUGFIRE  Imagery  Database.  The  objective  of  the 
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investigation  was  to  closely  approximate  the  hit  performance  level  of  a  firearms  examiner  using  an  automated  pattern 
recognition  system.  S-MODALS  was  trained  on  the  image  of  the  primer  area  of  only  one  fued  cartridge  case  firom  a 
particular  criminal  case.  S-MODALS  then  processed  all  of  the  other  candidate  cartridge  case  images  of  the  same 
Hrearm  caliber  type  retrieved  from  the  DRUGFIRE  Imagery  Database.  The  S-MODALS  database  query  produces  a 
prioritized  list  of  cartridge  case  image  matches  for  the  forensic  expert  from  best  match  to  worst  match.  The 
investigation  included  testing  9  cartridge  case  images  covering  four  different  firearm  caliber  types  (.40S&W,  10mm 
Auto,  .32  Auto,  and  .380  Auto),  where  each  firearms  caliber  type  had  a  different  number  of  test  cartridge  case  images 
(45,  26,  30,  176).  In  each  test  the  top  match  returned  by  S-MODALS  was  the  training  image.  The  next  match 
returned  by  S-MODALS  was  an  actual  DRUGFIRE  hit  FOr  the  .380  Auto  fired  cartridge  case  image,  S-MODALS 
returned  the  two  known  hits  as  the  top  two  candidates  among  176  cartridge  case  images.  In  each  instance  where  there 
was  more  than  one  cartridge  case  to  match  with  the  training  image,  S-MODALS  reduced  the  imagery  search  space 
down  to  its  smallest  achievable  amount,  except  in  one  incident  The  exception  was  that  S-MODALS  rejected  one  of 
the  actual  DRUGFIRE  hits  because  the  cartridge  case  image's  magnification  was  much  smaller  than  its  labeling  in 
the  database.  In  this  incident  S-MODALS  actually  performed  quality  control  on  the  imagery. 

Based  on  the  promising  potential  expressed  by  this  investigation  the  FBI  is  presently  "blind”  testing  the 
S-MODALS  technology  under  a  three  phase  test.  Each  phase  includes  searching  a  ^tabase  of  200  9mm  Luger 
cartridge  case  images  for  hits  with  four  unknown  cartridge  case  images.  Since  twenty-five  cartridge  case  images 
make  up  one  tile,  the  test  consists  of  a  potentially  common  operational  condition  where  a  firearms  e*  er  must 
search  8  tiles  of  imagery.  Upon  recently  completing  phase  one,  the  performance  of  the  S-MOI  pattern 
recognition  capabilities  has  ranked  all  of  the  DRUGFIRE  hits  within  the  top  five  matches  except  for  nstance 
when  a  catridge  case  was  placed  on  the  second  tile  as  the  38th  match  overall.  Figure  3  shows  the  top  four 
S-MODALS  candidates  for  one  of  the  unknown  cartridge  case  tests.  The  top  three  candidates  are  related  DRUGFIRE 
hits.  The  fourth  match  is  the  next  best  match. 

S.O  Conclusion 

A  neural  network  technology,  called  the  Statistical  -  Multiple  Object  Detection  and  Location  System,  was 
developed  for  multi-sensor  fusion  (Visible,  IR,  SAR)  and  multi-look  evidence  accrual  for  tactical  arid  strategic 
reconnaissance.  Since  S-MODALS  is  a  learning  system,  it  is  readily  adaptable  to  object  recognition  problems  other 
than  ATR  as  evidenced  by  this  S-MODALS  investigation  into  the  automated  database  query  of  DRUGFIRE  forensic 
imagery.  The  pattern  matching  problem  of  microscopic  marks  for  DRUGFIRE  shell  casings  is  analogous  to  the 
pattern  matching  problem  of  targets  for  the  Visible  component  of  the  S-MODALS  design.  Other  on-going 
investigations  include  applying  S-MODALS  to  face  recognition  and  medical  imagery  for  the  Air  Force. 
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Abstract 

Medial  axis  transform  (MAT)  based  features  and  a  two-layer  feedforward  neural  network  were  used  in 
this  study  for  human  chromosome  classification.  Two  ^proaches  to  the  MAT,  the  "skeleton”  and  the 
piecewise  linear  (PWL),  were  examined.  The  medial  axis  based  on  the  "skeleton"  approach,  as  well  as, 
the  chromosome  classification  results  based  on  this  ^proach  were  slightly  better  than  these  of  the  PWL 
approach.  Several  chromosome  features,  tike  tlK  density  profile,  the  centrometric  index  and  the  length  of 
the  chromosome,  as  well  as,  combinations  of  them,  were  tested.  The  prob^ility  of  correct  training  set 
classification  using  all  the  available  features  and  the  neural  network  classifier  was  almost  perfect  (99.3- 
99.6%).  The  probability  of  correct  test  set  classification  was  greater  than  97%  using  features  based  on  the 
TWL"  approach  and  over  98%  using  features  based  on  the  "skeleton”  iq>proach. 

1.  Introduction 

Human  chromosome  inspection  is  a  vital  task  in  cytogenetics,  especially  in  clinical  prenatal  aiudysis, 
genetical  syndrome  diagnosis  (e.g.,  Down's  syndrome),  cancer  pathology  research  and  environmentally 
induced  muu^n  dosimetry  [7],  [10].  Cells  used  for  chromosome  inspection  are  taken  mostly  from 
amniotic  fluid  or  blood  samples.  One  of  the  inspection  aims  is  to  detect  deviations  from  normal  cell 
structure.  Abnormal  cells  can  have  an  excess  or  deficit  of  a  chromosome  and/or  structural  defects  like 
tneaks,  fragments  or  translocations  (exchange  of  genetic  material  between  chromosomes).  However,  even 
today  this  inspection  is  performed  manually  in  most  of  the  cytogmetic  laboratories  in  a  time  consuming, 
repetitive  and  expensive  procedure  [9],  [loj. 

Efforts  to  develop  automatic  cluomosome  classification  techniques  have  been  made  through  the  last 
40  years.  However,  all  the  efforts  to  make  the  chromosome  analysis  automatic  had  limited  success  and 
poor  classification  results  compare  to  those  of  a  trained  cytotechnician  [2],  [7],  [9],  [10].  Some  of  the 
reasons  for  the  poor  performances  are  the  inadequate  use  of  the  expert  knowledge  and  experience  and  the 
instifficient  ability  to  make  comparisons  and/or  eliminations  among  chromosomes  within  the  same 
metaphase.  In  addition,  the  systems  always  require  the  operator  interaction  to  separate  touching  and/or 
overtyping  chromosomes  and  to  verify  the  classification  results  [7],  [10]. 

Neural  networics  make  it  possible  to  overcotiK  most  of  these  limitations.  This  is  mainly  because  they 
permit  application  of  expert  knowledge  and  experience  through  network  training.  Furthermore,  human 
chromosome  classification  based  on  neural  networks  requires  no  a  priori  assurtqrtions  or  knowledge  of 
the  data  to  be  classified  as  some  conventional  methods  need.  Finally,  it  is  well  known  that  the  probl^ns 
best  solved  by  neural  networks  are  those  that  humans  do  well,  and  classification  of  chromosomes  is  one 
of  them. 


#  This  work  was  supported  in  part  by  the  Paul  Ivanier  Center  for  Robotics  and  Production  Management,  Ben-Gurion 
University,  Beer-Sheva,  brael. 
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2.  Feature  description 


Appropriate  feature  description  is  considered  to  be  one  of  the  most  important  part  of  classification 
procedures,  and  in  human  chromosome  classification  it  is  probably  the  most  important  one.  In  some 
studies,  global  features,  like  the  histogram  of  gray  levels  [3]  or  the  2D  Fourier  transform  components  [4], 
have  been  used.  In  this  study,  we  have  employed  3  types  of  features:  the  density  profile  (d.p)  along  the 
medial  axis  [1],  [S],  [7],  the  centrometric  index  (c.i)  (the  ratio  of  the  short  arm  length  to  the  whole 
chromosome  length)  [2],  [S],  [7]  and  the  length  (Ing)  of  the  chromosome  [S],  [7].  The  Medial  Axis 
IVansform  (MAT)  is  almost  always  required  for  the  extraction  of  these  features. 

2.1  The  MAT 

The  MAT  is  widely  used  as  a  convenient  transformation  for  elongated  objects,  e.g.,  in  character 
recognition  or  chromosome  analysis  where  the  width  of  the  objects  contains  little  (if  any  at  all)  useful 
information.  The  MAT  of  an  object  cannot  only  reduce  storage  and  time  requirements,  but  also  to 
preserve  the  topological  properties  of  the  object. 

Two  different  approaches  to  MAT  were  used  in  this  work,  namely,  the  "skeleton"  and  the  PWL 
approaches  [Sj.  The  "skeleton"  approach  is  based  on  fmding  a  preliminary  medial  axis  of  chromosome 
via  the  realization  of  the  fire  front's  propagation  and  extinction  [11].  This  preliminary  medial  axis  is 
further  processed  to  get  one  extended  continuous  medial  axis.  Removing  irrelevant  points  of  the 
preliminary  medial  axis  on  one  hand  and  completion  of  necessary  points  on  the  other  hand  complete  the 
postprocessing  of  the  medial  axis  in  this  approach.  The  second  approach  employs  a  piecewise  linear 
(PWL)  approximation  [2],  [S]  to  the  medial  axis.  The  PWL  is  preferred  over  the  use  of  existing 
polynomial  approximation  techniques  whenever  a  chromosome  is  not  straight  [2]. 

2.2  Feature  extraction 

The  MAT  in  both  jqrproaches  enables  us  to  transform  the  2D  image  of  the  chromosome  to  ID 
representation.  By  calculating  lines  perpendicular  to  the  medial  axis  points  we  can  integrate  (or  average) 
the  intensities  (gray  levels)  of  all  the  image  pixels  along  these  lines  and  to  obtain  a  density  profile  (d.p). 

The  method  we  have  u.sed  in  this  study  to  calculate  the  centrometric  index  (c.i)  is  based  on  searching 
for  the  closest  pair  of  opposite  contour  points  on  the  clipped  contours  of  a  chromosome  [5],  similarly  to 
the  method  described  in  [2].  However,  instead  of  using  an  exhaustive  search  for  the  closest  pair  we 
searched  for  the  closest  pair  along  the  lines  perpendicular  to  the  medial  axis.  No  fundamental  difference 
in  results  of  the  two  methods  is  expected.  However,  our  method  is  faster  than  the  method  in  [2]  (there  is 
no  exhaustive  search  of  all  the  pairs  of  opposite  points).  The  length  of  the  chromosome  was  calculated 
along  the  medial  axis. 

All  the  features  were  further  normalized.  The  d.p  feature  vector  was  normalized  both  in  length  and  in 
value.  Normalizing  in  length  yields  suitable  feature  representation  (all  classified  vectors  are  in  the  same 
chmension)  and  invariance  to  scale  change.  The  length  of  the  nomudized  d.p  vector  was  set  to  be  64  both 
from  chromosome  length  and  from  practical  considerations.  The  64  values  of  the  d.p  vector,  the 
centrometric  index  and  the  chromosome  length  were  normalized  into  the  [-O.S,O.S]  range,  in  agreement 
with  the  MLP  requirements. 

3.  The  neural  network  classifier 

In  this  research,  a  two-layer  feedforward  neural  networic  trained  by  the  backpropagation  (bp)  learning 
algorithm  [8]  was  chosen  for  the  chromosome  classification.  The  bp  algorithm  is  an  error  driven 
parameter  estimation  algorithm  where  the  objective  is  to  minimize  the  output  squared  error  function  by 
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adjusting  interconnection  weights  and  node  thresholds.  Tire  network  was  initialize  using  random  weights 
in  the  [-1,1]  range.  The  number  of  hidden  units  of  the  network  was  set  according  to  the  Principal 
Components  Analysis  (PCA),  applied  to  the  feature  vectors.  The  number  was  set  to  be  the  number  of  the 
largest  eigenvalues,  the  sum  of  which  accounts  for  more  than  a  pre-specified  percentage  of  the  sum  of  all 
the  eigenvalues  [6].  This  pre-specified  percentage  has  been  called  by  us  "var".  In  the  implementations, 
the  "var"  parameter  was  set  to  values  of  70-90%. 

4.  Data  set 

Images  of  amniotic  fluid  cells  were  acquired  from  the  Institute  of  Medical  Genetics  of  Soroka  Medical 
Center,  Beer-Sheva.  The  pictures  were  obtained  with  the  aid  of  a  light  microscope  and  captured  by  a 
CCD  camera  (Cohu).  The  pictures  were  digitized  with  a  frame  grabber  (VISIONplus-AT).  The  size  of  the 
digitized  picture  was  S12  X  768  pixels  and  each  pixel  was  rei»esented  by  1  byte  (256  gray  levels).  No 
ine-processing  techniques  were  applied.  The  segmentation  was  done  manually  using  a  gnq>hical  software 
package  on  a  486  PC  computer.  Chromosonres  of  5  different  types,  namely  types  "2",  "4",  "13",  "19"  and 
"x"  were  extracted  [3],  [4]  from  more  than  ISO  different  cells. 

For  each  chromosome  the  MAT  was  extracted  and  the  66  features  (64  d.p  +  c.i  -t-  Ing)  were  computed 
using  the  procedure  described  in  [5].  Several  variations  of  features  were  tested  to  evaluate  their 
importance  to  the  classification  procedure,  e.g.,  d.p  alone,  d.p  -f  c.i,  d.p  -t-  c.i  +  Ing  and  c.i  -f  Ing.  The  d.p 
features  were  extracted  both  using  the  integral  representation  and  the  average  representation  and  in  bodi 
^proaches:  "skeleton"  and  PWL. 

5.  Results 

The  input  vector  to  the  neural  netwoiic  was  either  2  or  64-66  dimensional  (depend  on  the  type  of  the 
features).  The  output  vector  was  5  dimensional  with  one  component  set  to  "1"  (actually  0.9)  for  the 
ccMTect  classification  and  "0"  (actually  0.1)  elsewhere. 

Optiinization  of  the  neural  netwo^  parameters  regarding  the  chromosome  data  is  described  elsewhere 
[6].  The  learning  rate  (p)  was  set  to  be  0.026,  the  momentum  constant  (a)  to  be  0.97  and  the  training 
cycle  was  set  to  be  4000  epochs,  although  only  SOO-1000  epochs  were  required  to  get  almost  the  best 
results.  Training  and  test  vectors  were  chosen  randomly  from  the  same  data  set  where  the  number  of 
training  vectors  was  70-90%  of  all  the  vectors  (depending  on  the  experiment)  and  the  remaining  vectors 
were  reserved  for  testing.  All  the  simulations  were  repeated  (at  least)  3  times,  with  the  same  network 
parameters  but  with  different  sets  of  randomly  chosen  training  vectors,  and  the  results  were  averaged. 

5.1  The  PWL  vs.  the  "skeleton"  iq)proach  to  the  MAT 

Two  major  conclusions  can  be  made  [5]  while  comparing  the  PWL  and  the  "skeleton"  approaches. 


(a).  (b). 

Figure  1.  A  comparison  of  the  (a),  "skeleton"  and  (b).  PWL  approaches  to  the  MAT. 
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The  first  is  that  the  medial  axis  of  the  "skeleton"  approach  is  finer  than  the  axis  of  the  PWL  approach  and 
follows  very  accurately  the  chromosome  band  pattern  (Figure  1).  The  second  conclusion,  which  can  be 
concluded  from  Figure  2,  is  that  while  the  probability  of  correct  training  set  classification  is  similar  in 
both  approaches,  the  probability  of  correct  test  set  classification  is  larger  using  the  "skeleton"  features  (in 
about  Z-5%).  Both  conclusions  seem  to  be  very  close  related.  Figure  2  depicts  the  classification  results  of 
an  experiment  in  which  the  percentage  of  training  vectors  ("per")  is  70-90%  of  all  the  vectors  and  the 
"var"  parameter  is  set  to  be  70-90%. 


100 


training 

training 

training 

training 

test 

test 

test 

test 

var  70 

var  70 

var  90 

var  90 

var  70 

var  70 

var  90 

var  90 

per  70 

per  90 

per  70 

per  90 

per  70 

per  90 

per  70 

per  90 

Figure  2.  Classification  based  on  the  density  profile  features. 
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Figure  3.  Chromosome  clustering  into  a  2-dimensional  feature  space  spanned  by  the  centrometric  index 
(c.i)  and  the  chromosome  length  (Ing).  ("o"-  chromosome  type  "2",  chromosome  type  "4", 
chromosome  type  "13",  chromosome  type  "19"  and  "x"-  chromosome  type  "x"). 
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5.2  Feature  evaluation 


The  relative  importance  for  the  classification  procedure  of  four  sets  of  features  was  examined.  The 
first  set  includes  only  the  density  profile  (d.p)  features,  while  the  second  set  includes,  in  addition,  the 
centrometric  index  (c.i).  The  length  of  the  chromosomes  (Ing)  is  the  additional  feature  in  the  third  set. 
The  forth  set  includes  only  the  (c.i)  and  the  (Ing)  features.  To  learn  about  the  significance  of  these  two 
last  features,  we  have  plotted  in  Figure  3  the  two  of  them  one  against  the  other  for  the  entire  data  set.  We 
can  see  that  these  two  important  features  are  almost  sufficient  for  the  classification  of  the  chromosome 
data  into  it  S  types.  However,  these  two  features  would  not  be  enough  when  we  will  try  to  classify  the 
chromosome  data  to  all  its  24  types. 

The  probability  of  correct  classification  of  the  neural  network,  using  the  4  sets  of  features,  for  the 
PWL  approach,  is  given  in  Figure  4.  The  probability  of  correct  classification  in  the  training  and  in  the 
test  sets  using  the  first  set  of  features  (d.p)  was  99.15-99.5%  and  89.3-92.9%,  respectively,  for  various 
combinations  of  the  two  parameters-  "per”  and  "var”.  The  probability  of  classification  of  the  second  set 
of  features  (d.p  +  c.i)  was  99.3-99.5%  and  92.1-96.45%  for  training  and  test,  respectively,  and  this  of  the 
third  set  (d.p  +  c.i  +  Ing)  was  99.3-99.6%  and  94.2-97.2%  for  training  and  test,  respectively.  The 
probability,  using  only  the  (c.i)  and  the  (Ing)  features,  was  93.05-94.4%  for  training  and  86.9-92.9%  for 
the  test.  These  results  indicate  that  the  2  features  are  almost  equally  in^rtant  as  the  64  d.p  features  for 
the  classification  of  the  5  particular  classes  (types  of  chromosomes),  lliis  conclusion  will  be  definitely 
changed  whenever  all  the  24  chromosome  types  will  be  used.  Hie  probabilities  achieved  using  the 
"skeleton"  approach  were  equal  or  little  higher  compare  to  these  of  the  PWL.  It  can  be  clearly  seen  from 
the  figure  the  importance  of  combining  different  features,  especially  whenever  the  "var”  is  relatively  low 
(small  amount  of  information  is  retained  by  the  PCA). 


Figure  4.  The  probability  of  correct  test  classificafion  using  the  4  sets  of  features  and  the  PWL  approach 
to  the  MAT. 


6.  Discussion  and  Conclusions 

Medial  axis  transform  based  features  and  a  two-layer  feedfcHward  neural  network  woe  used  in  this 
study  for  human  chromosome  classification.  Two  tqiproaches  to  the  MAT,  the  "skeleton"  and  the  PWL, 
were  examined.  The  medial  axis  based  on  the  "skeleton"  approach,  as  well  as,  the  chromosome 
classification  results  based  on  this  approach  were  slightly  better  Aan  these  of  the  PWL  approach.  Sev^al 
typical  chromosome  features,  like  Ae  density  profile,  the  centrometric  index  and  the  length  of  die 
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chromosome,  as  well  as,  combinations  of  them,  were  tested.  When  classifying  only  S  types  of 
chromosomes,  as  was  done  in  this  study,  the  relative  importance  of  the  centrometric  index  and  of  the 
length  of  the  chromosome  is  very  high.  The  probability  of  correct  training  set  classification  using  all  the 
available  features  and  the  neural  network  classifier  was  almost  perfect  (99.3-99.6%).  The  probability  of 
correct  test  set  classification  was  greater  than  97%  using  features  based  on  the  PWL  approach  and  over 
98%  using  features  based  on  the  "skeleton"  fq)proach. 
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ABSTRACT 

We  describe  the  use  of  neural  networks  to  perform  context  dependent  thresholding  of  grayscale  three- 
dimensional  images  of  trabecular  and  cortical  bone  as  measured  tn  titfro  by  high-resolution  x-ray  computed 
tomognqjkhy  (“micro-C'l'").  Classifiers  are  constructed  on  the  basis  of  a  simple  model  of  the  blurring  neces¬ 
sarily  associated  with  the  tomographic  measurement.  We  discuss  the  procedure  used  for  training  and  testing 
and  illustrate  the  application  to  actual  experimental  data. 

1  Introduction 

The  ability  to  measure  with  high  (better  than  100  micrometer)  resolution  the  three-dimensional  structure 
of  small  spedmens  of  human  and  animal  bone  has  the  potential  for  substantially  increasing  understanding 
of  many  aspects  of  growth,  remodeling,  and  mechanics  of  bone  [1-3].  TVpical  cross-sectional  slices  extracted 
from  fiill  data  sets  are  shown  in  Fig.  1.  For  purposes  of  the  present  discussion,  it  is  sufficient  to  regard 
the  measuring  system  as  generating  an  estimate  of  density  at  each  point  of  a  three-dimensional  lattice 
superimposed  on  the  object  of  interest.  In  the  present  case,  the  lattice  spacing  is  50  /an  and  a  typical 
cross  sectional  dimension  of  a  bone  specimen  is  somewhat  leas  than  1  cm  (i.e.,  10000  pm).  The  density 
estimates,  which  were  reconstructed  firom  two-dimenmonal  images,  are  approximately  the  convolution  of  the 
actual  density  of  the  object  with  a  spatially  isotropic  resolution  fimction.  The  full  width  at  half  maximum 
(FWHM)  of  the  system  as  considered  here  is  approximately  59  pm. 

Many  structures  of  interest  in  trabecular  bone,  which  is  often  described  as  consisting  of  pistes  and  rods, 
ate  of  order  100-200  pm,  i.e.,  not  much  larger  than  the  resolution  of  the  measuring  system.  The  accessible 
stmctures  in  cortical  (dense)  bone  are  the  marrow  cavities,  which  can  have  dimensions  smaller  than  100 
pm.  Though  both  trabecular  and  cortical  structures  are  typically  clearly  visible  in  images  such  as  Fig.  1, 
quantitative  analyris  such  as  by  use  of  stereological  techniques  requires  each  sample  point  to  be  labelled 
as  foreground  (bone)  or  background  (non-bone).  A  simple  threshold,  either  that  used  in  Fig.  1  or  any 
other,  is  easily  seen  to  be  inadequate.  The  underlying  cause  is  the  interaction  of  measurement  resolution 
with  structures  of  different  sizes  and  degrees  of  curvature.  For  equal  intrinsic  bone  mineral  densities,  the 
measured  density  (or  gray  level)  that  distinguishes  bone  from  non-bone  will  be  higher  in  the  vicinity  a  flat 
or  concave  bone  surface  than  near  a  convex  surface.  For  example,  the  image  griy  level  in  the  center  of  an 
easOy  visible  cortical  cavity  is  as  high  as  that  of  much  of  the  bone  in  the  trabecular  region.  Our  iq>proadi  is 
thus  to  base  out  estimate  of  a  particular  lattice  point’s  classification  not  only  on  its  own  measured  density 
values  but  also  on  its  spatial  context,  i.e.,  on  the  densities  of  its  close  neighbors. 

2  Construction  of  Input  Vectors 

Input  vectors  are  obtained  by  moving  a  volumetric  window  through  the  data.  On  the  basis  of  preliminary 
experiments,  we  have  chosen  a  33-element  input  vector,  consisting  of  the  central  point  and  the  32  neighbors 
located  within  2  lattice  spacings.  This  choice  is  a  compromise  between  retaining  as  much  context  information 
as  possible  and  minimizing  the  number  of  network  parameters  by  restricting  the  number  of  input  variables. 
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Figure  1:  Tomographically  reconetnicted  slices  from  three  different  human  iliac  crest  core  biopsy  specimens 
ate  shown  in  the  top  panels.  A  fixed-threshold  binary  version  of  each  slice  is  also  shown.  Within  each 
specimen  cortical  bone  appears  at  the  top  and  bottom,  while  the  open  structure  of  trabecular  bone  occupies 
the  remainder  of  the  volume.  The  plastic  in  which  the  specimens  were  embedded  is  visible,  as  is  debris 
resulting  from  the  process  of  obtaining  the  spedmens.  The  scale  of  size  is  the  same  for  the  horizontal  and 
vertical  directions. 

3  Generation  of  Training  and  Testing  Sets 

It  was  not  feasible  to  obtain  experimentally  a  large,  representative  selection  of  correctly  labelled  exam¬ 
ples,  since  available  higher  resolution  imaging  methods,  e.g.,  light  microscopy,  are  largely  limited  to  two 
dimensions.  (We  have,  however,  used  direction  microscopic  examination  as  a  standard  in  evaluating  analy¬ 
ses  baaed  on  data  thresholded  by  earlier  methods  [3].)  As  an  alternative,  we  have  generated  input-output 
training  pairs  from  mathematical  ‘phantoms”  constructed  to  possess  curvatures  and  structural  thicknesses 
that  ate  representative  for  this  application.  Such  a  phantom,  shown  in  Fig.  2,  consists  of  a  set  of  concentric 
spherical  shells  of  unit  density.  Spherical  symmetry  was  chosen  to  avoid  embedding  unwarranted  anisotropy 
into  the  classifier.  Both  loo^y  convex  and  locally  concave  ibreground/background  interfaces  are  present. 
This  is  inqrartant,  since  trabecular  surfaces  are  frequently  convex,  while  the  bone  surface  as  view^  from 
cortical  cavities  is  predominantly  concave. 

The  phantom  is  folded  with  isotropic  Gaussian  resolution  (FWHM  69  iMta).  An  input  vector  is  generated 
by  sampling  the  blurred  phantom  at  a  selected  central  poation  and  at  32  noghboring  samples  on  a  50 
micrometer  lattice.  The  corresponding  target  output  value  is  the  binary  value  of  the  unblurred  phantom  at 
the  central  point.  Both  training  and  testing  instances  are  generated  by  selecting  randomly  placed  instances. 
Our  initial  trial  utilized  uniformly  distributed  training  instances.  We  observed,  however,  that  most  errors 
in  testing  came  from  the  regions  very  close  to  the  foreground-background  interfaces  and  disproportionately 
represented  regions  of  low  radius  of  curvature.  Hence,  we  adopted  an  alternative  procedure  in  which  we 
preferentially  select  training  instances  from  regions  that  straddle  the  interface  radii.  This  has  the  effect 
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Figaie  2:  The  center  slices  of  the  phantom  from  which  training  vectors  are  obtained.  The  binary  phantom 
at  left  is  subjected  to  Gaussian  resolution  to  produce  the  blurred  phantom  on  the  right.  The  thickness  of 
the  shells  represents  100  /im. 


of  excluding  from  training  the  approximately  40  percent  of  possible  instances  that  are  centered  on  largely 
uniform  regions.  Further,  we  give  more  weight  to  instances  near  surfaces  of  high  curvature  by  selecting  the 
regions  that  straddle  eadi  interface  with  equal  probability.  In  the  results  reported  here,  we  used  training 
sets  of 20000  instances.  For  testing  we  employed  531441  instances  selected  uniformly  from  a  similar  phantom 
whose  interfSsce  radii  interleave  those  of  Fig.  2. 

4  Networks 

4.1  Binary  IVee  Network 

Our  iiiq>lementation  of  a  binary-tree  network,  based  on  ideas  presented  in  [4],  was  described  briefly  in  [5]. 
Using  a  conjugate-gradient  algorithm,  each  node  of  the  network  generates  for  the  collection  of  instances 
presented  to  it  the  Fisher  linear  discriminant  direction  in  the  space  of  the  input  vector.  The  instances 
are  then  projected  onto  this  direction.  A  splitting  point  along  this  direction  is  determined  on  the  basis  of 
minimum  entropy  [4]  or  minimum  number  of  misclassifications.  If  tbe  node  branches,  the  entropy  criterion  is 
employed;  if  the  node  is  terminal,  the  minimum  errors  criterion  is  used.  One  way  for  the  node  to  be  terminal 
is  for  the  split  to  be  pure  (i.e.,  for  the  instances  presented  to  the  node  in  training  to  be  linearly  separable), 
in  which  case  the  two  criteria  yield  the  same  split  point.  If  the  split  is  not  pure,  the  node  may  still  be  made 
terminal  if  it  yields  insufficient  entropy  reduction  or  if  fewer  than  a  specified  number  of  instances  would  be 
propagated  to  each  of  its  prospective  branches.  Finally,  a  branch  of  a  nonterminal  node  may  itself  be  made 
terminal  if  too  few  instances  are  assigned  to  it  or  if  the  entropy  of  the  ensemble  of  instances  is  smaller  than 
a  specified  value. 

Such  termination  criteria  are  an  attempt  to  optimize  the  generalization  capability  of  the  resulting  network 
by  not  overfitting  the  training  data.  Determining  the  optimal  growth  of  binary  tree  classifiers  has  been  the 
subject  of  considerable  discussion  [6],  and  we  recognize  that  a  priori  criteria  are  likely  to  be  less  than  optimal. 
In  the  present  case,  however,  we  have  plentiful  examples  beyond  those  used  directly  for  triuning,  permitting 
us  to  perform  a  simple  backward  pruning  to  minimize  the  misclassification  rate.  (In  order  that  a  node  made 
terminal  in  the  pruning  process  can  be  (pven  the  split  point  that  minimized  errors  on  the  original  truning 
set,  we  retain  during  training  both  the  bias  wd^t  that  corresponds  to  the  split  for  minimum  entropy  and 
the  bias  weight  for  minimum  number  of  errors.) 
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A  potentUl  »dv«ntaige  of  tne  networks  is  that  the  first  few  nodes  toid  to  separate  off  the  obvious  cases, 
saving  time  in  training,  testing,  and  application.  A  potential  disadvantage  is  that  the  decision  boundaries  ate 
constmcted  from  hyperplaaes.  If  a  curved  decision  boundary  is  required  for  best  performance,  a  binary  uee 
may  run  out  of  training  instances  before  enough  nodes  are  grown  to  allow  the  boundary  to  be  ^tprcodmated 
adequately. 

4.2  Layered  Networks 

We  trained  e<«venti(»ial  layered  networks  of  several  architectures,  ranging  from  a  single  nonlinear  node  to 
networks  with  two  hidden  layers.  Ikaining  was  performed  with  both  standard  backpropagation  (SBP)  and 
a  method,  called  NDEKF,  baaed  on  a  node-decoupled  extended  Kalman  filter  [7].  The  training  parameters 
for  NDEKF  were  the  defralt  parameters  described  in  Reference  [7].  Twenty  cycles  through  the  training  set 
were  used  for  NDEKF,  while  200  cycles  were  employed  for  SBP. 

5  Results  of  IVaining  and  Testing 

The  binary  tree  generated  from  the  training  set  had  44  nodes,  and  produced  a  misclassification  rate  of  0.98% 
on  the  set  of  531441  from  which  the  20000  training  instances  were  selected.  Based  on  the  larger  set,  an 
optimal  pruning  was  carried  out,  reducing  the  tree  to  34  nodes  and  the  misclassification  rate  to  0.94%. 
When  ^plied  to  the  531441  independent  instances  of  the  testing  phantom,  the  error  rate  was  2.2%. 

Taken  together,  the  nodes  beyond  the  root  node  make  a  small  but  important  contribution  to  classification 
accuracy,  at  a  cost  of  somewhat  in  excess  of  a  factor  of  two  in  the  average  computation  time  requited  to 
classify  an  instance;  if  the  tree  is  restricted  to  the  toot  node,  the  misclassification  rates  ate  1.4%  and  2.5% 
on  the  original  and  testing  phantoms,  respectively. 

Using  the  NDEKF  approach,  a  sin^e  nonlinear  node  yields  an  error  rate  of  0.11%  on  the  original  phantom 
and  0.18%  on  the  testing  phantom.  Standard  backpropagation  did  not  do  quite  this  well,  but  the  error  rates 
at  0.29%  on  the  original  phantom  and  0.31%  on  the  testing  phantom  are  still  very  good.  (It  must  be 
mentioned  that  many  trials  were  required  before  we  found  SBP  parameters  that  led  to  effective  training.) 

We  also  employed  SBP  and  NDEKF  to  train  mngle-hidden-layer  networks  containing  2,  3,  and  4  hidden 
nodes  and  various  networks  with  two  hidden  layers.  Though  several  performed  very  well,  none  of  these  larger 
networks  proved  more  effective  at  generalization  than  the  mngle-node  NDEKF-trained  classifier. 

The  weights  of  both  the  first  node  of  the  tree  network  and  the  single  node  of  the  network  trained  by 
NDEKF  display  considerable  cubic  symmetry,  reflecting  the  symmetry  of  the  phantom  which  is  used  to 
generate  the  training  sets.  This  suggests  that  a  symmetrized  version  of  the  weight  vector  of  the  single-node 
network  would  be  effective.  Carrying  out  the  symmetrization  by  averaging  appropriate  weights  resulted  in 
approximately  the  same  performance  on  the  testing  set  as  noted  above.  By  averaging  input-vector  elements 
rdated  by  symmetry,  the  input  vector  can  be  reduced  to  5  elements.  We  did  this  and  retrained  single-node 
classifiers  using  both  SBP  and  NDEKF.  Ikuning  proceeds  more  rapidly  and  the  misclassification  rates  in 
testing  are  virtually  identical  to  those  reported  above. 

In  order  to  assure  that  the  outstanding  performance  of  the  NDEKF-trained  single-node  classifier  is  not 
overly  sensitive  to  the  resolution  used  in  creating  the  training  and  testing  instances,  we  repeated  the  testing 
with  instances  generated  with  resolutions  smaller  (54  fim)  and  larger  (64  /im)  than  the  original.  The  error 
rates  were  very  good,  0.34%  and  0.21%,  respectively,  indenting  a  considerable  degree  of  robustness.  In  a 
further  test,  the  foreground  and  background  of  the  original  phantom  were  reversed.  The  error  rate  for  this 
was  0.11%. 

6  Application  to  Experimental  Data 

From  the  standpoint  of  neural  networks,  the  relevant  generalization  has  already  been  discussed.  However, 
from  a  practical  standpoint,  the  important  consideration  is  performance  on  experimentally  obtained  data. 
This  performance,  in  turn,  is  infiuenced  by  several  factors  that  are  not  addressed  by  the  evaluation  of  network 
performance  just  presented.  Primary  among  these  is  the  degree  to  which  the  simple  model  used  in  creating 
the  phantom  reflects  the  true  transfer  function  of  the  measurement  process.  In  particular,  our  model  as 


in- 182 


Figure  3:  The  elioee  of  Fig.  1  cleaBified  by  the  binary-tree  network  (top)  and  by  the  NDEKF-trained  aingle- 
node  network  (bottom).  For  refoence,  images  of  the  phantom  in  binary  and  blurred  forms  are  shown  in 
prcqper  scale  in  the  bottom-right  comer. 


presented  does  not  include  statistical  noise,  which  is  certainly  present  experimentally.  Further,  the  phantom 
used  to  generate  examples  for  training  and  testing  does  not  model  either  spatial  variations  in  the  intrinsic 
mineral  density  or  variations  in  the  density  of  the  material  in  non-bone  regions. 

With  these  cautions  in  mind,  the  truned  networks  can  be  applied  to  experimental  data.  The  density 
(more  properly,  linear  attenuation  coefficient)  range  of  the  data  must  be  scaled  to  the  range  of  densities  of  the 
blurred  phantom.  This  is  accomplished  by  creating  density  histograms  for  the  entire  3D  experimental  data 
set  and  the  3D  phantom,  and  thm  matching  central  measures  of  their  respective  foreground  and  background 
peaks.  This  permits  the  same  trained  network  to  be  applied  to  a  succession  of  data  sets  for  which  the  relation 
between  mineral  density  and  measured  attenuation  coefficient  varies,  as  might  be  caused  by  slightly  different 
x-ray  primary  energies. 

In  Fig.  3  we  diq>lay  the  results  of  applying  the  optimally  pruned  binary  tree  and  the  NDEKF-trained 
single-node  dasaifier  to  the  data  of  Fig.  1.  It  should  be  noted  that,  because  of  the  volumetric  window,  the 
network  classifiers  make  use  of  more  information  than  appears  in  the  gray-scale  images  shown.  Though  the 
two  classifiers  3rield  slightly  different  results,  each  is  successful  in  preserving  cortical  cavity  structure  without 
unduly  thinning  the  delicate  trabecular  structure. 

7  Discussion 

The  weight  pattern  of  the  single-node  classifier  has  a  particularly  simple  structure.  The  central  value  is  large 
and  poative,  the  nearest  neighbors  are  much  smaller  and  positive,  and  the  remainder  negative.  The  sum 
of  the  elements,  together  with  the  bias,  determines  the  threshold  for  a  uniform  region  and  is  found  to  be 
perfectly  reasonable,  i.e.,  right  in  the  middle  of  the  density  range. 
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All  obvioni  extennoa  to  the  binary  tree  procedure,  which  should  be  useful  for  many  other  applications, 
is  to  use  NDEKF  as  an  alternative  means  of  determining  the  weights  for  each  node  of  the  binary  tree.  A 
possible  scheme,  which  we  are  investigating,  would  be  to  generate  both  the  Fisher  direction  and  the  LMS 
direction  by  the  NDEKF  procedure.  The  split  point  would  then  be  chosen,  as  mentioned  above,  on  the  basis 
of  either  minimum  entropy  or  minimum  errors,  with  the  betta  direction  used  in  either  case.  This  elaboration 
does  not  appear  to  be  required  in  the  present  application,  since  a  single  node,  determined  by  NDEKF,  leaves 
little  to  be  desired.  However,  extension  to  cases  with  anisotropic  resolution  or  other  complications  would 
likdy  require  more  than  a  angle  node. 

8  Conclusions 

We  have  described  the  application  of  neural  networks  to  the  problem  of  classifying  the  points  of  resolution- 
degraded  3D  data  sets  in  terms  of  an  underlying  binary  structure.  The  method  used  is  based  on  simple 
models  for  the  resolution  and  the  physical  structures  of  interest.  The  networks  employed  perform  well  on 
test  instances  generated  in  the  same  way  as  those  used  for  training.  Application  to  actual  data  is  also  quite 
satisfacUny,  and  this  method  is  currently  being  used  as  the  first  step  in  an  analysis  procedure. 

Though  not  described  here,  the  model  with  which  examples  are  generated  has  been  extended  to  include 
noiae.  This  is  found  to  be  useful  when  the  classifier  must  be  applied  to  data  in  which  the  noise  level  is  higher 
than  in  the  data  shown  here. 
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Abstract 

A  new  iterative  algorithm  is  proposed  to  oxnpute  the  energy  function  coefficients  of  a  tkpfield  Network.  The 
coefficieots  ^jedfy  the  relative  weights  of  the  oiergy  function  crunponents.  The  components,  typically, 
correspond  to  an  objective  function  and  srxne  constraints  when  solving  (q>timization  problems.  The  energy  function 
omuponents  compete  and  cooperate  with  each  other  as  the  network  settles  down  to  a  stable  state.  The  (fapfield 
Networic’s  Liapunov  function  is  nu^>ped  to  the  energy  function  to  derive  the  crmnecticm  weights  and  the  bias  inputs 
for  the  neurons.  The  descending  cs^tabiUty  of  the  Liapunov  function  is  shared  by  the  energy  function  ounponents 
based  on  the  relative  weights  of  their  coefficients  as  the  Hopfield  Network  evolves  towards  a  final  solutioo  state  with  a 
miniiniim  enagy.  Attention  must  be  paid  as  to  how  diese  energy  function  coefficioits  are  found  since  they  directly 
affect  the  validity  and  the  quality  at  the  solution  inU)  which  the  network  converges.  Determining  the  right  set  of 
coefficients  is  nontrivial  when  a^ytical  methods  are  employed.  The  iterative  algorithm  treats  the  energy  function 
omponents  as  emus  and  adjusts  the  coefficients  iteratively  to  minimize  the  errors.  The  validity  oi  the  algtHithm  is 
voified  with  a  dynamic  time  warping  Hopfield  Netwtuk  which  can  be  used  in  the  pattern  matching  phase  of  a 
pattern  recognition  system. 

1.  Introduction 

Hopfield  Network  is  a  fully  connected,  recurrent  neural  networic  with  symmetric  connection  weights  [6].  It  found 
applications  in  various  fidds  such  as  optimizatioo  [7],  [2S\,  pattern  recognition  [12],  [2S]  and  signal  processing  [11]. 
Moreover,  the  sitnilarities  between  the  Hof^eld  Networirs  and  glasses  attract  physkisls  to  describe  the  disordered 
media  [14].  The  existence  of  a  Liapunov  function  for  die  Hopfield  Network,  which  is  analogous  to  the  energy  fimetion, 
penruts  the  use  of  statistical  physics  methods  to  determine  the  stability  points  [27].  It  is  possible  to  consirua  a  Hopfield 
Ndwork  with  predefined  dynamically  stable  configmatioos  and  consequently  the  network  can  be  used  as  an  associative  or 
content  addressable  memory  [13].  The  Hopfield  Networks  provide  a  basic  modd  for  the  cognitive  and  computational 
neurosdenoe  fidds  [27].  Furthermore,  it  is  feaside  to  implement  the  Hof^dd  Network  in  silicon  with  the  present  analog 
VLSI  technology  [20]. 

This  paper  describes  a  new  iterative  algorithm  to  compute  the  energy  function  coefficients  that  specify  the 
relative  weights  of  the  energy  function  components  which  could  be  an  objective  function  and  some  associated 
constraints  when  solving  an  qMimization  problem  using  a  Hopfidd  Network.  Once  the  energy  function  is  defined, 
it  can  be  mai^ied  to  the  Luqiunov  function  ttf  the  Hopfield  Network  to  determine  the  connection  weights  and  the 
bias  input  for  the  neurons.  Fmding  the  |»oper  set  of  coeffidents  is  critical  since  these  coeffidents  directly  affea  the 
connection  weights  and  the  bias  inputs  which  in  turn  detomine  the  validity  and  the  quality  of  the  solution  into 
which  the  Hopfield  Netwruk  converges.  Thus,  attentkn  must  be  paid  as  to  how  these  co^dents  are  determined.  In 
most  at  the  studies  reported,  these  energy  coeffidents  are  found  emi^ically  [1],  [4],  [7],  [21].  Recently,  some  researchers 
proposed  that  the  dgenvalues  and  the  dgenvectors  of  the  cormection  wdght  matrix  can  be  expldted  to  find  the  energy 
confidents  when  die  nonlinear  system  equations  for  the  Hopfidd  network  are  qiproximated  by  linear  functions  [3]. 
Others  suggested  that  usdiil  relatfonships  among  the  energy  coefficients  can  be  obtained  and  the  ranges  for  the  energy 
oodficients  can  be  found  by  analyzing  the  stability  of  the  dynamical  fixed  points  [9].  The  proposed  algorithm  provides  a 
systematic  wsaas  to  oonqxjte  the  energy  function  oo^dents  directly.  There  is  no  need  to  approximate  the  network 
nonlinearities  nor  analyze  the  stability  of  the  dynamical  fixed  points  to  apfiy  the  algorithm.  The  new  algorithm  is  tested 
with  a  dynamic  time  warping  (DTW)  HopfieM  Network  which  was  reported  previously  [22],  [23].  The  DTW  is  an 
optimization  algorithm  which  oompares  patterns  m  find  an  optimal  match  under  some  constraints.  It  is  used  in  pattern 
recognition  iq)|dications  such  as  speech  recognition,  ^leaker  identification  and  speaker  verification  [16],  [17],  [19]. 
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SoiviBgof)ttiiiizttkii|id)iems  using  a  HopfieldNetwok  is  explained  m  die  next  sectkn  .  The  new  iterative  algoritbm  to 
find  the  energy  Anctiaa  ooefficknis  is  {xcscnted  in  Section  3.  Section  4  describes  the  DTW  using  Hopfield  Network.  The 
ettperimenial  resulls  verifying  the  validity  of  the  a|)|)toacfa  are  ^ven  in  Seclkn  5.  Finally,  the  conclusions  ate  (frawn  in 
Seoiond. 

2.  Solving  Optfanbation  Probtems  with  a  Hopfield  Network 

TaUe  1  sinninarizes  a  procedure  which  can  be  used  to  set  up  a  Hopfield  Network  to  srdve  an  (^Mimization 
proUem: 


Step  1.  Fmd  a  neural  network  representatioo  for  the  problem 
Stqi  2.  Determine  a  number  representation  with  the  neurons 
Step  3.  Define  a  Luqiunov  Functkn  L(v)  for  the  Hopfield  Network 
Step  4.  Devise  an  energy  fimctioa  E(v)  for  the  optimization  problem 

Step  S.  Derive  the  connection  wdghts  W  and  the  Inas  inputs  b  by  equating  L  of  Stq>  3  and  E  of  Stq>  4 

Step  6.  Compute  the  energy  functioocoefBcieotsc _ 

Table  1:  A  General  Procedure  to  Solve  an  Optimization  ProUmn  with  a  Hopfield  Network. 

In  stqi  1,  a  neural  representation  scheme  is  found.  It  is  necessary  to  assign  a  meaning  to  every  neuron  or  group  of 
neurons  as  to  what  these  neuron  outputs  depict  when  the  neural  network  converges  to  a  final  state.  Then,  a  number 
representatioo  scheme  with  the  neurons  is  determined  in  step  2,  since  most  problems  require  their  solutions  to  be 
in  numerical  form  [21].  Step  3  requires  the  definiti<Mi  of  a  Liapunov  function.  The  majority  of  the  studies  utilize 
the  following  quadratic  function  which  was  proposed  by  Hopfield  [6]  as  L(v)  =  -1/2  VWv  -  b'v  +  Jdag'‘ 
where  the  integral  is  from  0  to  vi  and  i  ranges  from  0  to  N-1.  The  ouqmts  at  the  neurons  are  rqiresented 
collectively  by  the  vector  v,  the  connectioo  weights  between  the  neurons  by  the  matrix  W,  the  bias  inputs  by  the 
vector  b  and  the  activation  fimetions  of  the  neurons  by  g.  In  4,  an  energy  function  consisting  of  an  objective 
function ,  possiUy  with  some  constraints  is  defined.  This  fimctiai  is  minimized  to  obtain  the  best  solution  under 
the  constraints.  The  characteristics  of  this  function  should  match  that  of  the  Liapunov  function  of  the  Hopfield 
network  since  there  will  be  a  mapping  between  the  two  to  determine  the  connection  weights  and  the  bias  inputs  for 
the  neurons.  Tyftoilly  a  quadratic  function  would  be  suitable  but  any  other  class  of  functions  could  also  be  used  as 
long  as  a  conmixnding  Liapunov  function  is  found  for  the  network.  The  constraints  can  be  added  to  the  objective 
fimetion  to  make  the  mtqiping  easier  between  these  two  functions.  It  has  been  shown  that  inequality,  as  well  as 
equality  constrained  optimization  problems  can  be  effectively  strived  by  means  of  the  Hopfield  Network  [2].  In 
general,  the  energy  function  can  be  devised  in  the  form  E(c,v)  =  1/2  CoEq  (v)  +  1/2  Ci  Ei  (v)  +  ...-«■  1/2  ckEk  (v) 
where  the  Eo  component  correqionds  to  the  objective  function  and  the  remaining  components  Ei  through  Ek 
rqxesent  the  constraints.  There  is  no  straightforward  m^bod  to  find  the  energy  function  for  a  given  problem. 
Each  proUem  requires  a  different  approach  and  the  energy  function  for  a  particular  problem  is  not  unique.  The 
Luqxmov  function  L(v)  and  the  energy  function  E(v)  are  equated  to  each  other  in  stqi  5,  and  the  connection 
weights  W  and  the  bias  iiqiuts  b  for  the  neurons  are  found  by  comparing  the  linear  and  the  quadratic  parts.  During 
this  derivation,  the  integration  component  of  the  Liapunov  function  is  ignored  since  its  contribution  is  negligible 
because  of  the  high  gain  of  the  activation  function  [6].  Also,  the  constant  term  in  the  energy  function  E(v)  is 
ignored  since  it  does  not  have  any  effect  on  the  result  while  minimization  is  taking  place.  Note  that  the  ctmnection 
weights  and  the  bias  inputs  which  are  found  by  equating  the  energy  fimetitm  to  the  Liapunov  function  assure  the 
minimization  of  the  ol^ective  function  along  with  the  constraints  by  enfmeing  the  neuron  outputs  to  follow  a 
monotonically  decreasing  energy  path  as  the  network  evolves.  Consequently,  when  the  netwenk  reaches  a 
minimum  energy  state,  a  solution  to  the  optimization  proUem  is  achieved.  Step  6  is  explained  in  the  following 
section. 

3.  Computation  of  the  Energy  Function  Coefficients 

The  energy  fimetion  coefficients  q  weigh  the  objective  function,  and  the  associated  constraint  components,  and  specify 
their  rdative  ffiares  in  the  descending  oqnbility  the  Luqxmov  fimetion  L(v)  as  die  Hopfidd  Network  evolves  tonards  a 
final  solution  state  with  a  minimum  energy.  These  energy  components  cooperate  and  compete  \ndi  each  other  during  the 
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itenMioosof  iheneuny  networiL  Since  ifaqr  antral  dtecntfgyiiiDctioDdtfectlylbeydeienniiietbe  quality  of  die  aotulian 
inio  wbicli  ibe  networic  onvciges. 

Rgute  1  sliows  a  schematic  fepresentatkn  of  the  algoritfam.  Hw  objective  is  to  adjust  the  energy  finctkn  oiefficients 
Q  in  such  a  way  that  the  energy  finctkn  ampooents  Ei  (that  ara  treated  as  enors  in  di^  CQoteu)  ate  pushed  towaids  their 
minana  so  that  high  quality  solutions  are  achieved  while  maintaining  the  validity  of  the  lesub. 

The  cnggyfinctioo  components  desoend  as  the  networic  converges  to  a  sodile  state  sinoe  the  annection  weights  W 
and  the  bias  inputs  b  ate  computed  I9  equating  the  energy  fincdon  E(v)  to  the  Liapunov  function  L(v)  which 
monolonicaUy  decreases  during  tte  iterations.  This  also  guarmtees  the  oonvergenoe  of  the  algorithm  that  is  confirmed  by 
the  oooqwter  simulations  performed  in  Section  S. 

Tobe  tdile  to  use  this  algorithm,  for  each  constraint  component  Eb  the  maximum  and  the  minimimi  values  (£« 
and  Ei^  )  have  to  be  calculated.  Then  the  error  ranges  can  easily  be  found  as  shown  in  TaUe  2.  Also,  one 
has  to  decide  on  the  training  set  and  determine  what  values  to  use  for  the  parameters  validity  threshold  N|  (the 
mnnber  of  times  the  neural  networic  has  to  converge  to  a  valid  srdudon  in  a  row),  and  the  adjustment  factor  Ac. 

The  selection  of  the  training  set  depends  on  the  api^ication  fidd  and  the  quality-validity  tradeoff.  The  ideal  training 
set  would  be  all  possible  ambinattons  of  the  iiqxas.  For  some  problems,  this  approach  may  not  practical  due  to  the 
abundance  of  hqwt  data.  In  such  cases,  the  use  of  a  rqxesentalive  subset  can  be  sufficieat  as  we  (fid  for  the  DTW 
apifiications  in  tl^  wodc.  The  optimal  sdection  of  the  training  set  can  be  quite  comifiicated  and  is  beyond  the  scope  of  this 
shidy. 


The  initial  values  of  the  energy  function  coefficients  are  sdected  as  Ck>sO,  Cl  s  l,c»is  l,C)sl,c«=  Lcs^  1.  This  way, 
the  competing  effect  of  the  objective  function  with  the  constraints  is  eUminated  at  the  b^inning.  Having  one  as  the  value 
fiv  the  constraint  coefficient  gives  equal  efiea  to  every  energy  fiaiction  constraint  conqwnenL  During  the  iterations  of 
each  ran  of  the  neural  network,  the  constraint  components  are  examined  whether  thi^  reach  their  tntnmui  and  die 
coefficient  with  the  highest  relative  error  is  updated  in  aooordanoe  with  the  proposed  algorithm.  Once  the  network 
achieves  Ni  times  valid  results  in  a  row  On  die  algorithm,  "valid”  is  the  counter  used  fiv  this  purpose)  then  the  objective 
fimction  coefficient  is  updated  to  push  the  network  towards  finding  better  quality  results.  IMtfa  this  new  value  of  the 
objective  function  ooeffii^t  the  above  process  is  rqieaied  until  the  network  teaches  an  invafid  final  state.  Once  tins 
ooon,  dm  value  of  dm  objective  function  coefficient  is  kept  constant  and  die  constraint  coefficients  ate  updatod  as  befive 
until  the  network  achieves  Ni  times  valid  solutions  in  a  row  again.  When  this  validity  threshold  is  exceed  the  network 
is  ready  for  a  higlier  object  fimction  coeffidenL  This  oontinnes  until  no  fivther  mqxovement  is  possible  for  the 
objective  fimction  coefficient  Ibe  mqxovemeot  is  measured  by  means  of  the  objective  function  to  constraint  ooefficicats 
ratios  (cd  A^)  as  defined  in  Table  2.  to  this  study,  if  any  of  th^  ratios  decreases  then  the  algorithm  terminates.  Better 
inqxovement  checking  mechanisms  can  be  dewrioped  oonsidermg  more  than  one  decrease  in  a  row  (checking  the 
average  of  the  ratios  over  a  few  runs  rather  than  halting  at  the  first  decrease) ,  the  rate  of  change  of  the  ratios  and  so  forth. 
The  sdection  of  the  validity  threshold  Ni  depends  on  the  application.  If  the  invalid  solutions  do  not  d^rade  die 
perfijrmance  <rf  the  system  significantly  as  reporied  in  [221,  [24],  then  this  value  can  be  chosen  smaller.  This  results  in  a 
laiger  oigective  function  coefficient  wtiicfa  in  tura  promotes  higher  (polity  solutions  but  more  often  invalid  results.  If  the 
quality  (rf  the  solution  is  not  the  prinuHty  concern  (having  a  valid  sohition  is  of  higher  priority)  then  the  validity  threshold 
Ni  can  be  set  to  a  large  value  which  suppresses  the  enlargement  of  the  objective  fimction  coefficienL  In  OUT  experiments, 
the  valkfity  threshold  Ni  is  picked  as  10.  The  ai^ustinentfector  Ac  is  another  parameter  (rf  the  algorithm  which  is  used  to 
calibrale  the  energy  function  coefficients.  A  constant  adjustment  fiictor,nanidy  Ac  s  0.1,  is  used  in  this  stu^.  Analytical 
methods  can  be  devdoped  to  find  more  precise  values  which  would  elicit  better  timing  (tf  the  energy  function  coefficients. 
U  should  be  noted  that  this  parameter  does  not  have  to  be  a  constant  Easter  and  better  results  might  be  achieved  Ity 
enqiioying  an  adaptive  ac^ustment  fiictor  during  the  iterations. 

4.  Dynamic  Time  Wariring  Usii^  a  Ht^fleld  NetwtMrk 

DTW  is  a  pattern  matching  algorithm  which  is  used  to  compare  an  input  test  pattern  with  a  reference  pattern  template 
and  obtain  an  optimum  match  sulgea  to  certain  constraints  [8].  The  associated  (fistanoe  between  the  two  patterns  is  also 
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detennined  during  the  process.  The  DTW  algorithm  eliminates  the  nonlinear  x-axis  variations  to  oon^rensate  for  the 
nonlinear  temporal  distcatHns  which  might  arise  due  to  the  variations  in  the  ^waking  rates  of  the  speakers  in  speech 
processing  j|)^ications.  Gxisequently  a  better  comparison  is  achieved  as  opposed  to  an  ordinary  direct  lemplatB  matching 
procedure  which  might  yield  a  larger  distanoe  between  the  two  patterns  despite  the  similarity.  It  is  widdy  utilized  in 
pattern  recognition  areas  such  as  speech  recognition,  speaker  verification  and  recognition  and  contributes  significantly  to 
the  perfemumces  of  these  ^>eecfa  processing  systems  [16],  [17],  [19].  While  effective  in  pmtem  recognitkn  the  DIW 
algorithm  is  lacking  in  that  the  processing  time  becomes  a  maior  consideration  for  real  time  applications  as  the  nmnber 
and  the  size  of  the  patterns  increase.  A  parallel  computing  mchitecture  becomes  the  only  avenue  to  achieve  the  high 
computational  rate  demanded.  A  possible  remedy  toward  this  end  is  the  use  of  a  Hopfield  Network. 

The  DTW  algorithm  can  be  formulated  as  a  minimmn  cost  path  problem  as  illustrated  in  Figure  2.  This  way  the 
problem  is  transformed  to  finding  an  optimal  alignment  path  m  s  w(n)  between  a  reference  pattern  r(n)  and  a  test  pattern 
t(m)  over  a  2-D  finite  cartesian  grid  of  size  N  x  N,  when  N  is  die  length  of  the  patterns,  n  and  m  are  the  discrete  time 
scale  indioes  for  the  reference  and  the  test  patterns  respectively.  Each  grid  node  vfnm)  has  a  specified  cost  d(r(n),t(m)) 
which  corresponds  to  the  distanoe  between  the  reference  pattern  sample  r(n)  and  the  test  pattern  sam|de  t(m).  The  problem 
is  to  obtain  the  minimum  cost  path  from  v(0,0)  to  v(N-l,N-l).  Note  that,  the  patterns  r(n)  and  l(m)  could  be 
multkhmensiooal  featme  vectors  representing  the  data  to  be  oonfipared. 

In  order  to  implement  an  effective  and  efficient  DTW  algoridun,  it  is  necessary  to  specify  a  number  of  foctors  and 
constraints  on  the  solution  which  could  vary  depending  on  the  ^pf^kation  fidd  [IS].  These  are  typically  endpoint 
constraints,  local  and  global  path  constraints,  axis  orientdion  tmd  distanoe  measure  specification.  The  em^xant 
constraints  match  the  boundary  points  of  the  test  and  reference  patterns  (i.e.,  w(0)  =  0,  w(N-l)  =  N-1).  The  local  path 
constraints  [8]  allow  only  the  arcs  with  slopes  0, 1  or  2,  and  avoid  consecutive  zero  slope  arcs.  These  constraints  guarantee 
that  the  average  slope  oi  the  warping  function  w(n)  lies  between  1/2  and  2,  provide  path  monotonicity  and  prevent 
excessive  compression  or  expansion  of  the  time  scales  of  the  patterns  as  depicted  in  Figure  2.  The  endpoint  and  the  local 
path  constraints  give  rise  to  the  global  path  constraints.  The  global  path  constraints  define  the  domain  d  the  matching 
operation  which  is  a  parallelogtam  (indicated  by  a  dashed  line)  as  shown  in  Hgure  2.  The  axis  orieotalion  may  affect  the 
performance  of  the  system  If  the  path  constraints  and/or  the  distanoe  measure  are  asymmetric  [18].  In  this  study  the 
reference  pattern  is  mapped  to  the  absdssa,  and  the  test  pattern  is  m^iped  to  the  ordinaie  as  shown  in  Figure  2.  The 
absolute  difference  metric  d(r(n),t(m)) « I  r(n)  •  t(w(n))  I  is  used  as  the  distance  measure.  So,  the  total  distanoe  along  the 
optimal  path  w(a)  from  the  grid  point  (0,0)  to  the  grid  point  (N-1  J4-1)  can  be  written D  »  min«(n)(Li  d(r(n),  t(w(n)))} 
where  n  runs  from  0  to  N-1.  The  type  of  the  distanoe  measure  used  by  the  DTW  algorithm  may  affea  the  matching  results 
depending  on  the  properties  of  the  patterns  compared  [5].  A  succinct  review  of  the  distanoe  measures  can  be  found  in  [10]. 

With  all  these  constraints  in  mind,  we  can  reiterate  the  detrition  of  the  DTW  problem  as  finding  an  optimal  warping 
path  m  a  w(n)  through  the  grid  points  v(n4n)  (in  Figure  2)  to  match  the  reference  pattern  r(n)  with  the  test  pattern  t(m) 
subject  to  the  constraints  such  thm  the  total  distanoe  D  is  minimized.  Thus,  for  the  particular  exampde  illustrate  in  Figure 
2,  the  optimal  warping  path  m  =  w(n)  (infficated  by  the  sdid  line)  goes  through  the  grid  nodes  v(0,0),  v(l,l),  v(2,l), 
v(3,3),  v<4,4)  and  v(S4)  and  corre^wnds  to  the  best  match  between  the  two  patterns  with  the  associated  total  distance  10 
(2+1-I-240+2-I-3).  Note  that  none  cf  the  other  valid  paths  (which  satisfy  the  constraints)  within  the  parallelogram  have 
smaller  total  distanoe. 

To  be  able  to  realize  the  DTW  algorithm  using  the  Hof^ld  Network,  the  procedure  given  in  TaUe  1  is  fdlowed: 
Every  grid  point  on  the  (njn)  plane  in  Figure  2  can  be  nati^y  represented  by  a  neuron.  Thus,  a  two  dimensional  array 
(of  size  N  X  N)  iqxesentation  is  used  for  the  neural  networic  with  a  total  numbv  of  N^  neurons.  The  neuron  outputs  will 
be  denoted  by  v^  with  subscripts  x  (for  ordinate  m)  and  i  (for  abscissa  n)  showing  the  row  and  the  cdumn  indices 
reflectively.  The  optimal  path  m  =  w(n),  which  corresponds  to  the  optimal  match  between  the  test  and  the  reference 
patterns,  will  be  determined  by  the  neurons  which  have  outputs  1  when  the  network  converges  to  a  stable  state. 

The  second  step  of  the  procedure  given  in  Table  1  can  be  omitted  since  there  is  no  need  to  find  a  number 
representation  with  the  neuron  groups  for  the  iniffementation  of  the  DTW  algorithm.  The  neuron  outputs  of  the 
continuous  Hopfield  Network  stays  in  the  range  0  through  1  and  the  neuron  outputs  with  binary  states  0  and  1  are 
sufficient  to  represent  the  warping  path  m  =  w(n).  To  ensure  the  validity  of  the  path,  the  neurons  are  forced  to  have  binary 
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vafaKsOor  1  by  moms  afaai9f)ro|)riaie  constraint  cciniKneDt  in  tbeDTW  energy  fiinctkn.  As  a  result  of  this  ccnslraint, 
tbe  neuron  stales  oonveige  id  eithtf  0  or  1  ou4>ins  when  the  Hopfidd  Networit  reaches  a  minimum  energy  std)le  state 
which  corresponds  to  one  of  the  corners  of  the  NxN  dimensional  hyperadK. 


Hence,  by  scrutinizing  the  warping  path  m  -  w(n)  through  the  grid  nodes  in  Hgure  2,  and  oonsidermg  the  objective 
fimclion  D  (loud  distanoe  along  the  opti^  path  w(n)X  and  the  DTW  constraints  described,  the  foUowing  energy  function 
E(v)  can  be  constructed  for  the  DTW  algorithm 


Cq  N-lN-lN-lf/  \  1 

=  Txio  iloyioLl***.!  +^4-l>"x4^4-l  J 


c,  N  - 1  N  - 1  N  - 1 
+  ^  S  I  X  V  .V 


c-  N  - 1  N  - 1  N  - 1 


N  -1  N  -1 


+  '^  X  X  X  V  .V  .+•^1  X  X  V  .-N 

X  =:  Oi  =  0 


^  *  =  Oi  =  Oy^x  *’*  y-*'*’!  ^  x  =  Oi  =  Oy^x  *•*  y**  ^ 

y  ^  *+  1 
y  *  x  +  2 


<=4VW  "5VN-1/  ^2 

^Txioilbj^f  ^x.iVx.j-TxSoilbi'-'"*4) 


where  modulo  N  is  used  for  the  subscr^ts  wherever  applicable  ( i.e.,  N  s  0.).  The  detailed  explanaiian  of  the  energy 
function  and  the  derivalioiis  of  the  connection  weights  a^  tbe  bias  iiq)uts  can  be  found  in  [22]. 


The  last  stqxrf  the  procedure  given  in  Table  1  is  the  computation  of  the  energy  fonctkncodEdentsc.  By  using  the 
energy  function  E(v),  and  the  foUowing  equations  (With  N^lO),  the  number  of  tbe  neurons  at  each  column  and  tbe  total 
number  of  neurons  •including  the  boundary  neurons  if  any  -  within  the  boundary  of  the  parallefogram  (An  example  for 
N=6  is  shown  in  Figure  2)  can  easily  be  calculated.  The  number  of  neurons  at  coiwnnn  is  mH(n)-mL(n)+  1  sod  the 
total  number  of  neurons  are  equal  to  [oih  (n)  -  mt  (n)  +  1]  where  n  »  0.-,  N-1.  Hence,  starting  with  column  0,  the 
number  of  neurons  at  eadi  column  are  1, 3, 4, 6, 6, 6, 61, 4, 3, 1  which  adds  to  40.  Now,  in  the  succeeding  paragrtqdis, 
we  will  analyze  each  conqxnent  of  the  DTW  energy  function  E(vX  and  calculate  tbe  numerical  values  of  the  error  ranges 
for  die  energy  fonctkn  constraints. 


The  component  Eo  (wdghted  by  /2)  conre^xnds  to  tbe  objective  function  that  minimizes  the  total  distanoe  D 
between  the  two  pattens  along  the  warping  path  w(n)  through  tbe  grid  points. 


The  component  El  (weighted  by  Ci/2)  stands  for  tbe  Itakura  path  slope  oonstrainL  Tbe  slopes  of  tbe  ares  between  the 
grid  nodes  are  pushed  to  0,  l,or2by  thisoonqxxienL  It  takes  its  minimum  value  zero  when  all  neurons  have  output  zero. 
The  maximum  value  is  reached  if  all  neuron  outputs  ate  one.  For  column  i.  there  can  be  7  allowtfole  arcs  (wifo  slopes 
other  than  zero^  one  or  two)  connecting  at^aoent  neurons  at  column  i-t-1.  Therefore  7xl0xl0=7(X)  is  the  maximum  value 
of  the  foncdon  for  the  endte  NxN  grid  If  we  consider  only  the  paraMogram,  then  it  is 
0t44-12+20f21-i-20f  12-i440+0s93.  Therefore  the  error  range  for  this  energy  funoion  component  is  93  -Oa:93. 

E2  (weighted  by  Cj  /2)  forces  every  samide  of  die  reforence  pattern  to  be  visited  once  during  matching  widi  the  test 
pattern.  It  becomes  minimum  (zero)  when  all  neurons  outputs  are  zero.  Tbe  maximum  value  is  reached  if  all  neuron 
outputs  are  one.  For  column  i,  diere  are  9  posable  multqdications.  Therefore  9xlQxlO=9(X)  is  the  maximum  value  of  the 
function  for  the  NxN  grid  For  parallelogram,  it  is  lx(>4-3x244x3f6x546x5+foiS46x544x3-i-3x2'flx0=156.  So,  tbe  error 
range  is  1S64N1S6. 


Because  of  E3  (weighted  by  C3/2X  the  network  ends  up  having  N  active  neurons  (ouqiut  value  one)  when  a  stable  state 
is  reached  Ej  has  the  minimum  value  zero  if  all  neurons  have  zero  oulpids.  The  maximum  value  is  attained  when  all 
neuron  outputs  are  one.  Thus,  is  the  maximum  the  function  can  get  for  the  complete  NxN  grid  For  the  parallelogram,  it  is 
(AO-lOf^SOO.  Gonsequendy,  the  enor  range  is  9(X>4)b900. 
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Successive  zero  slope  arcs  in  each  row  are  avoided  by  tbe  oompcnent  E«  (weighted  by  c«/2).  It  reaches  its  minimum 
vahie  zero  if  aU  neuron  outputs  are  equal  to  zera  Tbe  maximum  value  is  acquired  when  all  neuron  ouqwts  me  equal  to 
one.  For  row  x,  there  are  8  neuron  coiqdes  (for  which  tbe  outputs  are  to  be  multiplied).  Hence  tbe  maximum  value  of  the 
fimction  fir  tbe  whole  NxN  grid  can  be  8x10x10:800,  mid  fir  tbe  paraUdogrmn,  beginning  with  row  zero,  is 
Of2-i-12'f  12-fl2+12-i-12-»-12-»-2-»0=76.  The  error  range  fir  this  component  is  76^76. 

Es  (weighted  by  Cs/2)  bc^  the  neurons  to  have  0  or  1  ouqxjt  when  the  network  converges  to  a  minimum  energy  state. 
Es  reaches  its  minimum  value  -100  fir  the  NxN  grid  v^ien  ail  neuron  omputs  are  either  zero  or  one.  For  the 
patallelograin,  it  is  -40  since  there  are  40  neurons  inside  (induding  tbe  border)  the  paralldogram.  The  maximum  value  is 
obtained  if  all  neuron  ouqxits  are  0.5  which  is  the  fuzziest  state  fir  the  neurons  with  (l-2x0.S>=0,  and  tbe  error  ramge  is  0- 
(-40)o40. 

5.  Experimental  Results 

The  dynamical  behavior  of  the  Hopfield  Network  is  represented  by  the  differential  equation  U  =  Wv  -fb .  The 
number  cif  equations  is  equal  to  the  number  of  the  neurons  in  the  network,  and  the  operation  at  the  neural  network 
is  simulated  by  solving  these  differential  equations  simultaneously.  The  equations  are  solved  numerically  using 
Euler's  method  [22],  [23]. 

First,  to  elucidate  the  operation  at  the  new  iterative  algorithm  (given  in  Table  2),  an  illustrative  experiment  is 
carried  out  with  the  adjustment  factor  Ac  =  0.2,  and  the  validity  threshold  Ni  =S.  The  reference  pattern  r  and  tbe 
test  pattern  ti  shown  in  Figure  S  (a)  are  used  as  tbe  training  input  Figure  3  shows  the  progress  of  the  energy 
function  onnponents  Ei ,  (i  =  0,...,S)  per  run  for  100  runs  (SO  iteraticms  each).  The  energy  function  coefficients 
converged  to  Cb=  0.8,  Ci  =  4.0,  C2= 4.8,  C3=  1.2,  C4=  1.4,  C5=  1.6  at  the  end  of  100  runs.  The  initial  values  used  for  the 
energy  function  coefficients  were  Cb=  0,  Ci  =  1,  C2=  1,  C3=  1,  c«=  1,  Cs^  1  as  suggested  in  the  algorithm.  Note  that  tbe 
energy  function  cmnponents  descend  smoother  and  more  consistently  during  the  iterations  as  the  coeffidents  are 
adjusted  at  each  run. 

Next,  to  carry  out  the  subsequent  experiments,  the  algrxithm  is  run  with  tbe  adjustment  factor  Ac=0.1  and  tbe 
validity  threshold  Ni  ::10.  The  reference  r  and  the  test  patterns  tj  and  t2  shown  in  Figure  5  (a)  and  (c)  are  utilized 
as  the  training  input.  At  the  end  of  the  training,  the  energy  function  coefficients  are  cnnputed  as  2.0,  Cj »  13.8, 
Ci=  13.8,  C3= 4.5, 04=  6.3,  C5=  1.5.  The  initial  values  of  the  coefficients  were  C6=  0,  Ci*  1,  ©2=  1,  C3=  1,  C4=  1,  C5=  1 
as  suggested  in  the  algmithm. 

To  evaluate  the  performance  of  the  network,  uniformly  distributed  random  reference  and  test  signals  are 
generated.  F¥om  these  signals  a  distance  matrix  d  is  produced  (absolute  differences  between  tbe  signal  samples  as 
shown  in  Figure  2).  The  distances  are  normalized  to  the  unit  square.  Using  d,  tbe  optimal  warping  path 
corresponding  to  the  global  minimum  total  distance  and  tbe  path  with  tbe  global  maximum  distance  are 
detennined  by  going  through  all  tbe  possible  paths  within  tbe  parallelogram,  as  shown  in  Figure  2.  Then  tbe 
DTW  Hopfield  Netwok  is  employed  to  find  tbe  optimal  path.  A  distance  measure  is  defined  to  compare  the  results 
as  dcM  =  (mitiNN  -  mino  )/  (maxc  -  mino  jxlOO,  where  mino  and  maxo  are  tbe  global  miniminn  and  maximum 
distances  corresponding  to  the  best  and  worst  warping  paths  and  miniw  is  tbe  minimum  distaiKe  corresponding  to 
the  optimal  path  found  by  the  network.  doM  is  the  percentage  of  tbe  distance  to  tbe  global  miniminn  and  represents 
the  independent  variable  on  the  hmizontal  axis  in  Figure  4  (a)  and  (b).  Tbe  y-axis  denotes  the  number  at  times  doti 
occurred  out  of  500  runs.  Two  tests  ate  run  to  measure  the  performance  of  the  DTW  Hopfield  Network  with  tbe 
constraints  coefficients  Ci  =  13.8,  C2  =  13.8,  Cs  =  4.5,  C4  =  6.3,  cs  =  l.S.  In  tbe  first  test,  tbe  numerical  value  of  the 
objective  function  coefficient  is  taken  as  cb=  2.0.  Then  the  same  test  is  rqieated  with  a  more  aggressive  objective 
function  coefficient,  ck)=  4.0,  to  demonstrate  its  inqiact  on  tbe  solution  validity  and  quality.  Figure  4  (a)  shows  the 
test  results  with  tbe  energy  function  coefficients  Cb=  2.0,  Ci=  13.8,  Ci=  13.8,  Cs-4.5,  c«= 6.3,  Cs=  1.5.  The  network 
converged  to  a  valid  solutioo  96  %  of  the  time  and  reached  tbe  global  minimum  20  times.  In  our  previous  study 
(without  using  tbe  new  iterative  algoridim)  the  results  were  85  %  and  6  times  reflectively  [  23].  With  tbe  energy 
function  coeffidents  Cd^ 4.0,  Ci=  13.8,  C2=  13.8, C3= 4.5, ©4=  6.3, 1.5,  tbe  result  summarized  in  Figure  4  (b)  is 
obtained.  Using  this  set,  tbe  DTW  Hopfield  Network  reached  a  valid  solutioo  12%  aS  Out  time  and  amverged  to 


tbe  global  minimum  56  times  which  were  63  %  and  27  previously  [23].  In  this  case,  the  quality  ol  the  paths  foimd 
are  siqtoior  to  the  prior  case  as  expected.  The  reasou  for  this  is  t^  while  the  constraint  coefficienis  Ci ,  Pz ,  Cs ,  q  , 
cj  enforce  the  validity  of  the  warping  path  the  objective  function  coeCRdent  co,  competes  with  them  to  minimize 
the  total  diMance  associated  with  the  path.  Thus,  the  quality  of  the  DTW  path  can  be  improved  by  increasing  tbe 
value  of  cb  but  this  results  in  more  fiequent  invalid  paths.  For  both  cases,  the  network  converges  to  a  valid  solution 
in  less  than  SO  iterations  (peaking  around  20)  and  the  results  achieved  show  that  the  n^ork  is  capable  of 
matching  the  refermce  and  test  patterns  effectively. 

Tbe  purpose  the  last  experiment  is  to  demonstrate  tbe  superiority  of  the  pattern  matching  performed  by  tbe 
DTW  I^^d  Network  over  tbe  ordinary  direct  tmplate  matching.  First,  tbe  direct  template  matching  is  ^plied 
to  tbe  reference  pattern  r  and  tbe  test  patterns  ti,  tz  which  are  shown  in  Figure  S  (a)  and  (c).  Tbe  rd^lute 
difference  distance  metric  (Ixl)  is  used  to  calcuUue  all  distances.  Tbe  distance  between  r  and  ti  is  found  as  62  (20- 
14  +  15-3  +  5-1  +  4-0  +  11-4.9  +  14.9-13  +  20-14  +  19-15  +  16-5  +  7-0),  and  the  distance  between  r  and  tz  is 
fotmd  as  55  (20-15  +  15-15  +  12.5-5  +10-0  +  10-4.9  +  14.9-10  +  20-10  +  15-7.5  +  5-5  +  5-0).  Thus,  according  to 
tbe  direct  template  matching,  the  test  pattern  ti  is  more  similar  to  tbe  reference  pattern  r  than  the  test  pattern  ti  . 
Next,  tbe  DTW  Hopfield  Netwok  (with  the  energy  function  coefficients  =  4.0,  Ci  =  13.8,  Cz  =  13.8,  cs  =  4.5,  c« « 
6.3,  C5  =  1.5)  is  used  to  find  the  distances  between  tbe  same  patterns.  This  time  tbe  distance  between  r  and  tz  is 
found  as  1.93  (in  14  iterations)  and  the  distance  between  r  and  tz  is  found  as  3.77  (in  20  iterations).  Figure  5  (b) 
and  (d)  illustrate  the  effect  of  DTW  clearly.  As  tbe  results  show,  the  DTW  Hopfield  Networit  can  compare  patterns 
more  intelligently  and  achieve  better  solution  than  that  of  the  ordinary  direct  template  matrhing 

6.  Conclusions 

Ibe  main  objective  this  Shitty  is  to  show  that  tbe  proposed  iterative  algorithm  can  be  used  to  conqwte  better  energy 
function  coefBcients  far  a  Hopfield  Network.  A  DTW  alg^thm,  which  compares  two  patterns  to  obtrto  the  best  match 
under  some  constraints,  is  used  to  verify  the  validity  (tf  the  ^jproacfa.  The  idea  behind  this  algorithm  is  to  find  the  optimal 
balance  among  the  energy  fiinctioo  oomponents  to  obtain  a  high  quality  result  while  mamtaming  the  validity  of  the 
solution.  The  algorithm  has  the  flexibility  to  accommodate  diffierat  quality  lequirenoents  of  diverse  opthnizmion 
ptoblenas.  The  results  provided  in  Section  5  verify  that,  this  algorithm  finds  a  goal  set  of  energy  coefficients  which 
induces  a  superior  pattern  match  than  that  of  the  ordinary  ditect  lenq^ate  matching  The  same  set  of  energy  function 
coefficients  was  also  used  to  measure  the  petfixmmce  of  the  DTW  Hopfield  Network  rdative  to  the  traditional  DTW 
algorithm.  Using  the  DTW  Hopfield  Network  along  with  tbe  new  iterative  algorithm,  more  satis&cuxy  results  are 
achieved  in  oonqnrison  to  our  prevkms  study  [23]. 

The  procedure  given  in  Section  2  provides  a  methodicai  approach  to  solve  optimization  problems  using  the  Hopfield 
Network.  Most  of  tbe  steps  in  this  prooechne  are  straigbtfixrward,  except  the  neural  network  representafion  and  tbe 
definitkn  of  ttie  energy  functkn.  There  can  be  more  than  one  valid  neural  network  rqxesentation  and  energy  function  for 
a  given  problem.  The  DHTW  energy  fimctioo  E(t),  d^ned  in  Section  4,  is  neither  imique  nor  claimed  to  be  the  best 
energy  fmctkn  for  die  DTW  proUm.  Combining  some  ot  tbe  constraint  components  and/or  inoorpotating  diem  into  tbe 
objective  fimction  would  reduce  tbe  number  of  energy  function  coefficients.  But  then  it  would  na  be  possiUe  to  control  tbe 
effects  of  these  components  independendy.  It  should  be  noted  that  tbe  components  of  the  energy  fmctkxis  oonqwte  and 
cooperate  widi  each  other,  while  tbe  neural  network  descends  with  the  Liapunov  function,  as  dictated  by  die  energy 
function,  toward  a  stable  minimum  energy  state.  Tbe  energy  function  coeffiqents  Co  through  cs  define  the  diaracteristics 
(rf  this  fiEilling  motian.  There  is  a  ddicate  balance  among  these  components  which  are  weighted  by  the  energy  fimction 
coefficients.  It  would  be  interesting  to  study  the  effects  of  changing  tbe  energy  function  coefficients  dynamically  (as  a 
fimction  of  energy)  as  tbe  nanal  network  evolves  toward  a  st^udon  state.  This  could  aid  the  DTW  HofAdd  Network  to 
reach  lower  ininima  vvith  filter  convergence  rates. 

The  effect  of  the  objective  fiinctioo  (relative  to  the  constraint  components)  could  be  retfaiced  by  ralihraimg  the  energy 
coefficients  if  maintaining  a  valid  result  has  a  higher  priority  than  the  quality  of  tbe  solution.  For  the  signal  recognition 
system  described  in  [24],  the  quality  of  the  path  was  tbe  main  concern  (validity  oi  the  path  had  secondary  importance) 
since  only  unconelated  signals  pulled  tbe  network  to  the  invalid  state  ^lace.  When  the  signals  were  nmilar,  the  neural 
network  remained  in  the  valid  state  ^pace. 
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Find  error  ranges  £,-  for  constraints; 

e/*  i=l,. ...K 

/*  K;  Number  of  energy  function  constraints  */ 

Initialize  energy  function  coefTicients: 

c,-  4-1;  i*l....,K  /♦  constraint  coefficients  */ 

Cq  4-0/*  relax  objective  function  coefficient  */ 
Initialize  improvement  ratios: 

i=l....,K 

While  (training  is  not  sufficient) 

Apply  an  input  to  networit  from  training  set 

While  m,  i=l . K 

valid  4-0  /*  reset  validity  in  a  row  counter  */ 

Find  connection  weights  W  and  bias  inputs  b 
While  {valid  <  N|)  /*  below  validity  threshold  */ 

Initialize  neurons  and  run  network 
if  £,  =  ;  i=l....,K 

then  /*  valid  result  */ 
valid  *-valid  +  1 
else  t*  invalid  result  */ 

{ 

valid  4-0 

Find  max  normalized  error  and  its  index  j 
£• 

max^i  .  ,  „  . 

i  ^ ••  •it.,  j 

Adjust  cj:  Cj  <r-Cj  *  Ac 
Fmd  W  and  b 
) 

end  While  , 

Adjust  cq:  Cq  4-Cq  +  Ac 

Update  improvement  ratios: 

NEW..,  K 

end  While 
end  While 

Stop  _ 

Table  2:  An  Itoative  Algorithm  to  Find  the  Energy 
Function  Coefficients  c(),...,ck 


Hopfield  Network 
W,b 


Figure  1 :  A  Schematic  Representation  of 
the  Iterative  Algorithm  that  Finds  the 
Energy  Function  Coefficients 


m  =  w(n) 


5  - 4 

4  - 5 


t(m) 
4  5 


% 

1 

* 

1 

3 

3 

Xi 

1 

:o 

1 

i2-' 

4 

*  ... 

1 

...hj. 

is 

/ 

.4 

.2 

3 

■A 

. 

is 

/  2 

/  :6 

.^0 

0 

/ ; 

jl' 

.3 

13 

y 

':4 

1 

J 

•  1 

!l 

0 

1 

•> 

3 

4 

J 

Figure  2:  A  DTW  Example  Depicting  an  Optimal 
Alignment  Path  m  =  w(n)  to  match  r(n)  to  t(m) 
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Abstract 

This  pqter  ejqtlores  a  hierarchical  arrangeinem  of  neural  networks  ^lied  to  pattern  classification 
problems.  The  structure  consists  of  a  switching  network  and  a  collection  of  leaf  networics.  The  switching 
networic  has  the  reqwnsibility  for  selecting  which  of  the  leaf  networks  will  ultimately  perform  the 
classification.  The  number  of  leaf  networks,  as  well  as  the  classes  assigned  to  each  le^  network,  is 
determined  by  iterative  application  of  a  clustering  algorithm.  The  iterations  terminate  when  an  objective 
fiinction  is  minimized.  The  advantage  of  this  method  is  the  modularization  of  the  network  structure 
which  dramatically  reduces  training  time  and  allows  locally  confined  network  maintenance.  For  the 
multi-font  character  recognition  problem  considered  here,  classification  accuracy  remained  cortqrarable  to 
that  of  a  single  back  propagation  network  and  training  time  was  reduced  by  a  fsxiOT  of  20.  Greater 
qreediqirs  can  be  achieved  if  parallel  training  and  clustering  are  used.  A  classification  problem  is  also 
presented  for  which  the  classification  accuracy  exceeds  that  of  a  single  back  prc^gation  network  and 
reduces  training  time. 


Introduction 


The  usefulness  of  neural  networirs  for  classification  problems  is  based  iq»n  a  networic's  ability  to  construct 
arbitrarily  complex  decision  surfiices.  This  is  fiequently  accortyrlished  ^  training  a  single  network  to 
separate  all  classes  simultaneously.  Thus,  the  training  algorithm  must  find  a  single  set  of  weights  which 
accurately  classifies  all  sanqtles.  This  is  analogous  to  sotting  a  million  names  by  moving  all  of  them  at 
<Hioe  and  hoping  that  the  renting  ordering  is  closer  to  a  sorted  list.  Per^le  sort  large  lists  by  first 
sqarating  tte  elements  into  smaller  gro(q)s,  such  as  by  the  first  l^r  in  a  name.  The  resulting  sublists 
are  then  sorted.  A  sunilartqjproach  is  taken  with  the  network  structure  described  here.  Thedassesare 
first  grouped  into  clusters  and  a  sqKuate  neural  network,  a  "leaT  network,  is  associated  with  each  cluster. 
A  "switching  network"  is  reqwnsible  for  selecting  the  at^ropriate  leaf  network  to  perform  the  final 
classification  task. 

This  approach  has  several  advantages  over  the  traditional  monolithic  training  algorithm.  First,  the 
resulting  network  is  a  collection  of  "plug-in"  components.  If  a  more  efficient  switching  network  can  be 
identified,  that  component  can  be  removed  from  the  tree  and  replied  with  the  new  network  without 
disnqrting  the  operation  of  the  leaf  networks.  This  structure  does  not  require  homogeneous  topologies  or 
training  algorithms,  giving  the  (fesigner  flexibility  to  attack  localized  pn^lems  with  apprr^riate  solutions. 
Similarly,  if  additional  data  fiom  one  class  becomes  available,  the  corresponding  leaf  networic  can  be 
removed,  retrained  and  reinserted  into  the  tree  without  affecting  the  remaining  networks.  A  second 
advantage  is  that  the  resulting  networks  are  smaller  and  thus  require  less  time  to  train.  In  addition,  since 
fewer  sq)arating  sur&ces  must  be  identified,  each  network  has  a  sinq>ler  problem  to  solve  than  a  single 
network.  This  too  contributes  to  decreased  training  recjuirements.  Third,  since  the  networks  operate 
indqiendently,  they  can  be  trained  in  parallel.  A  multi^rcxossor  system  or  a  collection  of  workstations 
can  be  used  to  train  the  structure  in  aj^roximately  the  amount  of  time  re(]uired  to  train  the  largest 
network  in  the  structure. 

The  idea  of  creating  a  hierarchical  structure  of  neural  networics  is  not  new.  Tree  structures  have  been 
used  to  decompose  problems  as  well  as  to  increase  reliability  through  combining  decisions  firom  multiple 
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blanches  of  a  classification  tree  [2].  Of  paiticular  interest  are  the  CART  algorithms,  which  generate 
neural  tree  classifiers,  and  have  been  shown  to  be  very  effective  [1][7]I10][11][12][13].  Neural  tree 
classifiers  combine  classification  trees  with  neural  n^works  by  utilizing  relatively  small  neural  networks 
at  the  interior  nodes  of  the  tree  to  identify  flitting  rules.  The  CART  (classification  and  regression  tree) 
method  for  constructing  classification  trees  proceeds  in  two  phases.  First  is  the  growing  phase  which 
recursively  finds  splitting  rules  at  interior  nodes  optimizing  a  criterion  such  as  an  impurity  measure. 
Tradition^y,  the  qrlitting  rules  have  been  based  on  single  features  or  linear  combinations  of  features. 
Neural  networks  h^  been  used  to  generalize  this  mdhod  by  finding  nonlinear  combinations  of  features 
on  which  to  base  the  splitting  rules[S].  The  second  phase  is  a  pruning  phase  in  which  a  subtree  is  selected 
based  on  minimization  of  an  enor*compiexity  criterion.  Leaf  nodes  are  associated  with  a  single  class. 

Our  ^roach  is  to  associate  leaf  nodes  not  with  single  classes  but  with  clusteis  of  classes.  A  neural 
network  serves  as  a  switching  device  to  select  the  correct  leaf  network.  A  leaf  network  is  then  used  to 
perform  the  classification  within  the  cluster.  One  advantage  of  this  approach  is  a  reduction  in  the  height 
of  the  tree  and  an  accompanying  reduction  in  classification  time.  More  importantly,  the  switching 
network  and  all  of  the  leaf  networks  can  be  trained  in  parallel  which  greatly  reduces  traiiting  time. 

Method 

The  structure  consists  of  four  components  ;  the  switching  network,  the  leaf  networks,  the  clustering 
algorithm,  and  the  error  recovery  algorithm.  Note  that  a  single  decision  network  may  not  be  rqrprc^riate 
for  every  problent  A  hierarchy  of  switching  networks  was  also  irrqrlemented,  and  while  effective,  was  not 
necessary  for  the  problems  stuped.  However,  other  classification  tasks  may  benefit  from  an  additional 
layer  of  switches. 

The  algorithm  can  be  described  as  follows ; 
step  I  :Cluster  the  classes 

step  2  ;Train  a  switching  network  to  classify  a  vector  as  a  member  of  a  given  cluster 
step  3  :Train  each  leaf  network  to  discriminate  between  the  classes  for  which  it  is  re^nsible. 
Note  that  all  of  the  leaf  networks  and  the  switching  network  can  be  trained  simultaneously. 
step  4 :  Present  a  testing  vector  to  the  switching  network.  The  switching  network  will  select  a 
letff  network.  The  leaf  network  will  classify  the  vector  or  have  an  insufficient  re^nse  to  make  a 
classification.  In  that  event,  the  switching  network  selects  the  next  most  likely  candidate  and 
repeats  step  4. 

The  first  stq>,  determining  the  number  of  leaf  networics  and  the  class  distribution,  can  be  accomplished  by 
a  clustering  algorithm.  A  maximum  distance  clustering  algorithm  was  the  most  effective  of  the  well 
known  algorithms  we  investigated  [3][6].  However,  we  introduce  a  variation  designed  to  produce  a 
relatively  even  distribution  of  class  assignments.  Three  concreting  elements  must  be  balanced  in  selecting 
the  proper  network  topology :  the  number  of  leaf  networks,  the  difficulty  of  the  classification  task  each 
must  p^orm,  and  the  difficulty  of  selecting  the  correct  leaf  network.  As  the  number  of  leaf  nmwotks 
increases,  the  difficulty  of  the  individual  tasks  decreases  but  that  difficulty  is  simply  transferred  to  the 
switching  network.  Alternatively,  if  too  few  leaf  networks  are  used  classification  accuracy  suffers  and  as 
the  network  sizes  increase  so  does  the  trainirtg  time.  The  algorithm  presented  below  seeks  to  balance 
these  cortrpeting  interests. 

The  clustering  algorithm  used  here  can  be  summarized  as  follows  : 

step  I :  Select  an  initial  value,  n,  for  the  number  of  clusters.  This  is  an  artificial  starting  point 
and  will  be  adjusted  by  successive  applications  of  the  algorithm. 

step  2  ;  Select  cluster  seeds.  The  process  of  selecting  seeds  for  the  desired  number  of  clusters  is 
(te^bed  irrductively.  The  first  two  seeds,  S|  and  $2,  are  chosen  so  that  d(S|  ,S2)=||s^-S2ll  is  a 
nraximum.  Siqrpose  S.  ^{Sj,S2,...,Si^}  is  the  set  of  seeds  chosen  for  the  first  x  dusters,  '^lect 
X  tS,  such  that ; 
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15  a  maximum  and  set  Rqieat  tbe  process  until  all  n  seeds  have  been  selected. 

step  3  :  Select  a  pmnt  p  to  add  to  a  cluster.  Each  point  not  yet  assigned  to  a  cluster  is  oonsideied 
in  turn.  A  point  is  temporarity  added  to  duster  Cj  and ; 

xeCj^eCj 


iscomputed.  The  pdnt  is  assigned  to  the  cluster  far  which  the  above  sum  is  minimi/ed.  One 
additional  consideration  must  be  eiqdofed.  If  a  point is  a  member  of  a  class  c,  and  p  has  been 
assigned  to  duster  Cj.  what  does  that  iaqdy  about  othCT  elements  of  c?  One  possil»lity  is  to  allow 
individual  points  tote  assigned  without  rdeienoe  to  prior  assignments  of  other  dements  of  c. 
This  implies  that  more  than  one  leaf  netwodi  may  be  le^nsible  for  classifying  vectors  in  a 
given  class.  This  is  <mly  practical  if  a  sufficient  number  of  training  sanqiles  is  available  for 
each  leaf  network.  Anottier  possflnlity  is  to  force  aU  dements  of  a  class  to  be  groiqied  in  tbe 
sameduster.  This  can  be  done  by  assigning  <me  dement  in  a  class  using  the  tedmique 
described  above  then  assigning  all  other  dements  of  that  class  to  the  same  cluster.  An  alternative 
method  is  to  perform  the  above  calculation  using  classes  instead  of  individual  points. 

Stq»  1>3  cantecortqwtedinparalldforavariefyofii  values.  To  determine  an  q>timal  number  of 

clusters,  compute  the  average  duster  tightness,  where  average  cluster  tightness  is  defined  by  ; 
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where  It  is  the  number  dusters  and  is  the  number  rtfpdnts  in  duster  Q.  Let  the  total  number 

ofpoints.  As  It  increases  fixun  1  to  A/,  fn  decreases.fi]^rrqH<Uy^tI>cnin(»eslow]yuntil7n°=^- 
Judgment  is  used  to  sdect  that  value  for  which  Ar„,  that  is  -T„,  is  suffidently  small. 


In  these  eiqieriments  all  networks  were  trained  with  back  propagation,  but  it  is  not  necessary  to  do  so  or 
even  to  haw  all  netwmks  trained  using  the  same  algorithm.  One  of  the  advantages  of  the  "plug-in” 
conqxwents  is  that  any  fype  of  network  can  be  inserted  at  ariy  point  in  the  tree. 


An  error  recovery  algorithm  is  necessary  to  handle  situations  in  which  the  wrtmg  leaf  network  was 
sdected  by  the  switching  network.  In  mar^  situations,  if  a  network  is  not  able  to  classify  a  vector,  it  will 
ptodme  small  output  at  each  ofthe  output  nodes.  The  coned  solution  is  to  set  a  threshold  value  and  if 
none  of  the  onqwts  reaches  the  threshold  value,  the  current  leaf  network  is  declared  to  be  in^ipropriate 
and  an  aberrutte  selection  is  rtuKle.  The  leaf  network  which  received  the  second  largest  ouqwt  value  from 
the  switdiing  network  is  selected  and  the  process  is  repeated. 


Erqierlmental  Results 

A  Sample  Problem 

A  sanqde  problem  was  devised  to  test  the  effectiveness  of  this  al^rithm  relative  to  a  single  bade 
propagatkm  networie.  The  data  was  produced  by  generating  random  points  within  fourteen  overlapping 
spiiens.  The  elonentsofeach  ofthe  fourteen  classes  were  randomly  divided  in  halfwith  one  groiq)  used 
for  training  and  the  other  fm  testing.  The  clustering  algorithm  was  applied  to  the  data  and  produced  four 
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dustaStCadi  with  time  or  four  classes.  Note  that  the  classes  are  not  lineal  sqartd^.  Thetopok^ 
used  fot  this  ejqteriment  was  a  switching  network  and  four  leaf  networics  each  with  three  iiqwt  nodes,  fiNir 
hidden  nodes  a^  4  output  nodes..  The  back  impagation  network  consisted  of  3  irqxits  nod^  a  single 
hidden  li^  with  8  nodes,  and  14  output  nodu.  The  number  of  hidden  units  in  the  back  propagation 
network  was  varied  fiom  6  to  13  with  the  best  classification  accuracy  prorhioed  by  a  3-8-14  ndwmk.  The 
aocurao'  for  the  back  propagatkm  network  was  95%  curect  m  the  testirig  data  while  the  hierarchical 
network  was  able  to  classify  100%  of  the  test  vectors  correctly.  The  switching  network  made  the  correct 
selectkm  on  the  first  try  10^  of  the  time.  Training  time  was  reduced  by  a  fyctor  of  two  without  using 
parallel  training.  We  report  this  experiment  to  demonstrate  that  problems  do  exist  for  which  this  netwmk 
can  reduce  training  time  and  improve  classification  accuracy  relative  to  a  single  bade  propagation 
network. 

Character  Recognition 

A  more  difficult  problem  is  that  of  rmilti-font  character  recogrution.  Computer  generated  characters  in  six 
fonts  were  digiti^  and  a  feature  extraction  mechanism  was  used  to  create  156  vectors,  each  containing 
14  dements  [4][8].  These  were  divided  into  groups  of  78  vectors.  One  groiq;>  contained  vectors 
representing  each  character  in  three  fonts  and  was  used  for  training  The  otto  group  of  78  vectors 
contained  the  remaining  three  fonts  and  was  used  for  testing. 

The  clustering  algorithm  produced  four  clusters  each  with  six  or  seven  classes.  The  switching  n^work 
used  was  a  neural  network  trained  using  bade  prt^ragation  and  consisted  of  14  input  nodes,  a  single 
hidden  l^er  with  4  hidden  nodes  and  4  output  nodes.  Each  of  the  four  leaf  networks  was  also  trained 
with  back  propagation  and  contained  14  irqxit  nodes,  6  hidden  rxxies  and  7  ou^rut  nodes.  Asingleback 
propagation  network  with  14  input  rxxles,  20  hidden  nodes  and  26  output  tmdes  was  trained.  This 
topology  produced  the  best  results  for  the  rmilti-font  character  data  set  as  rqmrted  in  [8], 

Several  criteria  were  used  to  compare  this  t^^roacb  to  a  single  bade  propagation  network.  First,  ctHnsider 
the  size  of  the  network.  A  network  with  14  input  nodes,  20  hidden  nodes  and  26  ouqait  nodes  contains  60 
nodes  and,  including  bias  weights,  846  weights.  The  nutriber  of  weights  is  particularly  important  since 
each  must  be  ipdated  for  one  iteration.  Training  was  stepped  at  700  iterations  by  the  criteria  that  the 
change  in  the  error  over  a  100  iteration  period  was  smaller  than  a  predefined  op^on.  This  resulted  in  a 
total  of  592,200  weight  ipdates.  Since  weight  updates  are  the  most  erqiensive  part  of  the  algmithm,  this 
is  a  good  measure  of  tela^  speed.  In  contrast  the  decision  tree  neural  network  with  its  five  networks 
contained  130  nodes  and  636  weights.  However,  since  each  leaf  network  is  assigned  a  ampler  task,  fewer 
training  iterations  were  required.  The  average  number  of  iterations  for  all  five  networks  was  400, 
resulting  in  254,400  weight  tpdates.  In  this  case,  the  decisitm  tree  neural  network  required  ^roximately 
25%  less  storage  for  weights  and  reduced  the  nutriber  of  weight  iqxbtes  by  approximately  42% 

Timings  were  also  conducted  using  a  Sun  Workstation.  The  siit^e  network  required  5.39 
seconds/iteration  to  train,  or  a  total  of  3773  seconds.  Two  timinp  imist  be  considered  for  the  decision  tree 
networic.  First,  code  was  written  to  train  the  networks  on  a  single  processor  machine.  The  total  time  was 
402.8  seconds  However,  one  of  the  advantages  of  this  architecture  is  that  all  five  networks  can  be  trained 
in  parallel.  Thus,  a  five  processor  machine,  or  five  processes  running  on  five  dedicated  workstations,  can 
produce  the  weights  in  approximately  168.4  seconds,  or  the  maximmn  of  the  five  indepoident  training 
tunes.  Thus,  training  times  were  recfaiced  by  89%  for  the  single  processor  implementation  and  95%  using 
parallel  training.  Note  that  the  number  of  weight  rqxiates  is  reduced  by  42%  while  training  time  is 
reduced  by  89%  even  for  the  sequential  implementatioa  This  difference  can  be  attributed  to  the  fiict  that 
the  amount  of  time  required  for  a  weight  iqxiate  is  dependent  iqx>n  the  size  of  the  network. 

Classification  accuracy  was  also  measured.  The  single  network  had  an  accuracy  of  100%  on  the  training 
vectors  and  90%  on  tlw  testing  vectors.  The  multi-stage  network  also  classified  100%  of  the  training 
vectors  and  90%  of  the  training  vectors  correctly.  Thus,  performance  was  rtot  affected  and  the  time  and 
spxx  required  to  achieve  this  performance  were  significantly  reduced.  Note  also  that  the  switching 
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netwoik  was  drfe  to  ukntify  the  oonect  leaf  netwwk  on  the  first  attonpt  100%  of  the  tune  for  the  training 
data  and  98.7%  times  for  the  testing  data. 

Conchiaioa 

This  research  attadcs  the  problem  of  pattern  classification  when  the  number  of  classes  is  large  and  rapid 
training  time  is  necessary.  An  emph^  was  placed  on  designing  a  topology  which  could  erqrioitlar^ 
grain  parallelism  or  benefit  fiom  a  distributed  computing  environment.  The  approach  is  to  cluster  the 
classes,  use  a  neural  network  as  a  switch  to  select  the  rqsiuopriate  cluster,  and  utilize  neural  networks  to 
build  intracluster  sqiorating  suifsces.  The  switching  network  and  the  leaf  networks  can  be  trained  in 
parallel  as  can  instances  of  the  clustering  algorithm  with  different  nunfoers  of  seed  points.  Ejqreriments 
using  this  topol<^  have  produced  classification  accuracy  comparable  to  that  generated  by  a  morKdithic 
badt  propagation  network  and  training  times  have  been  consistently  reduced.  For  the  multi-font  character 
data,  a  q)eedupofa&ctorof20  was  achieved  including  the  time  required  to  generate  the  dusters.  An 
additional  ben^t  of  this  structure  is  the  creation  of  “plug-in”  components,  that  is,  individual  networics 
can  be  removed  and  replaced  with  more  effective  structures  or  retrained  as  additional  data  becomes 
availAle  without  affecting  the  remaining  networks  in  the  hierarchy.  Not  all  of  our  experiments  have  been 
rqwrted  here,  but  our  results  consistently  show  accuracy  cortqrarable  to  that  achieved  by  the  best 
monolithic  back  prc^gation  networks  with  significant  training  time  reductions. 

Future  work  will  corKentrate  on  finding  foster  and  more  effective  clustering  algorithms  as  well  as  making 
improvements  to  the  network  learning  rules.  One  planned  addition  to  the  leaf  networks  is  the 
incorporation  of  the  “don’t  care”  training  algorithm.  In  a  don’t  care  network,  the  separating  surfooes  are 
combinations  of  surfooes  which  separate  pairs  of  classes  rather  than  the  traditional  afftoach  of  sqxuating 
one  class  from  all  others.  The  algorithm  is  fully  described  in  [9]  and  has  proven  very  effective  in  a  variety 
of  rqrplications.  As  with  the  structure  described  above,  don’t  care  networla  build  corrqrlicated  sqxuating 
surfoces  from  sinqrle  coirqwnents,  each  of  which  is  easier  and  quidcer  to  identify  than  the  single 
sqrarating  surfoce.  FurthCT  reductions  in  training  time  should  be  possible  with  this  method. 
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Abstract 

We  pnaeot  a  mnlii-bqier  feed-tewanl  mml  netwoik  diat  has  been  built  up  for  pattern 
recognition  pmpoaes.  Since  we  pot  qiedal  emphasis  onto  the  aptitude  towanb  hindware 
mptenientatioo,  ire  provided  it  wMi  a  modular  architecture  of  partially  coiHiected  subnets.  At  its 
tyiii  dds  ^Kcial  it  to  avoid  random  initiaii«arifi  of  gie  weisbts. 

The  learainf  snaiegy  is  dianctesiaed  by  die  adaptation  of  the  learning  rate  coupled  widi  an 
intennediaaB  "baidiing*  of  the  error  hack-propagatian.  Tb  test  die  performance  of  die  net  in  die 
dassification  of  hand-wiitlen  amnefals,  ire  fed  the  iqmt  units  widi  gr^-diaded  patterns 
obtained  from  a  large  data  set  of  the  US  Natkmal  InstibitB  of  Standards  and  Tedmdogy  (NISI). 

The  original  samples  have  been  sntaected  only  to  scaling  operations.  The  results  qi  to  now 
obtained  seem  quite  interesting.  In  fact,  in  dus  qiplicadon  our  recognizer  ranks  well  among  die 
tea  best-taied  OCR  qrstemswiiidicowfyetBd  in  a  world-wide  contest  organized  by  NIST  itself  in 
June  1991 

1  Introduction 

Multi-Layer  PeroqimHis  (MLPS)  are  presently  used  in  a  large  variety  of  pattern  recognition  tasks, 
so  tfiey  realty  ctmstitnte  a  clear  example  of  general-purpose  devices.  However,  when  dealing  widi 
leaLworid  applications  like  OCR,  MLPs  are  seldom  udlued  in  a  ^lain  form:  in  &ct  they  are 
nsualty  tailored  to  specific  requirements,  and  soroetimes  they  become  part  of  more  conqilex 
daasitying  systems. 

This  stems  frmn  the  fact  diat,  aldioug^  its  operatum  princtyle  is  quite  straightforward,  yet  die 
design  of  an  efficient  MLP  classifier  consdtntes  a  conqifex  task,  since  at  least  tihree  main  issues 
are  to  be  cmisidered: 

1)  Eaqietimental  results  deariy  show  diat  performance  cannot  be  set  free  from  die  way 
infocmatkm  is  encoded  into  die  examples.  Data  Preprocessing  strategy  thus  assumes  major 
relevanoe,  in  that  it  can  dramatically  affect  die  final  res^  as  wdl  as  die  amount  of  resources  diat 
suffices  to  achieve  it; 

2)  Given  die  iqiplication,  we  would  hke  to  determine  the  optimal  Network  Architecture  by 
means  of  few  specaficadons  and  design  rules.  As  a  matter  of  feet,  such  process  instead  relies  upon 
heuristic  JKMces; 

3)  The  Learning  Procedure  usually  involves  several  parameters.  Again,  their  values  must  be 
determined  mainty  on  a  trial-and-enror  basis,  especially  if  advanced  techniques  like  weight  deca/^^ 
areadopted. 

It  is  tiienoe  evidmt  tiiat  additional  hints  are  needed  in  order  to  property  address  the  design 
strategy.  In  feet,  several  solutions  have  been  pressed  diat  eiqiloit  suggestions  often  provided  by 
the  qiplication  itself.  For  instance,  several  works  are  concerned  widi  prior  extraction  of  relevant 
features  from  die  raw  data:  this  is  done  by  means  of  traditional  algoridmisl^i,  by  setting  up  hybrid 
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netwocksi^),  or  via  hi^ily  constrained  architectures  with  several  hidden  layers!^  that  q>ply  to 
MLP$  some  ideas  owned  die  Neocognitront^l.  An  alternative  interesting  t^iproach  leads  to  the 
definidoo  of  MLP  committees  in  conjunction  with  data  resampling  and  die  generation  of  synthetic 
pattemsrn. 

On  the  other  side,  we  must  take  into  account  the  existence  of  cmventional  techniques  diat 
already  proved  to  be  very  effective,  especially  in  die  OCR  fieldi*!.  Conqiared  to  them,  Aitifidal 
Neural  Networks  can  rely  on  dieir  own  intrinsic,  massive  paralldlism:  but  diis  winning 
characteristic  cannot  be  exploited  by  software  simulations  on  serial  jvocessors.  So  we  think  duit  it 
is  required  to  give  priority  inqiottanoe  to  the  hardware  feasibility  of  die  proposed  solutions.  This 
concept,  while  forcing  os  to  cope  with  stringent  constraints,  amazingly  turned  into  a  guideline 
atong  dw  fatmulation  of  those  answers  to  the  above  issues  that  are  described  in  the  ftdlowing 
sections. 


2  Data  Preprocessing 

For  our  training  eiqieriments  we  chose  a  database  of  hand-written  digits  collected  by  the  US 
National  Institute  of  Standards  and  Technology  (NIST).  It  consists  of  223125  samples  stored  as 
hmary  images  inside  matrices  of  128*128  pixels  each. 

We  performed  only  scaling  operations  on  die  original  data  to  produce  patterns  with  qiecified 
dimensitms  and  number  of  grey  levels.  For  this  purpose  we  developed  two  different  algorithms: 
the  first  one  forces  die  character  to  "touch"  all  four  borders  of  die  outyut  image,  vdiile  die  other 
one  preserves  its  original  aspect  ratio. 

At  first  we  utilized  such  procedures  to  build  two  distinct  training  sets  of  16*16  binary  patterns. 
We  then  carried  out  several  simulatitHis  using  two  copies  of  die  same  MLP  to  assess  the  best 
alternative  solutiort  Unfortunately,  in  any  case  we  did  not  obtain  encouraging  results.  However, 
we  achieved  substantial  inqirovemaits  by  averaging  the  responses  of  die  two  nets.  This  fact 
suggested  the  oj^xntunity  of  feeding  a  single  MLP  with  both  verrions  of  the  same  data.  In  order 
to  reduce  the  number  of  irqiut  coirqxments  widiout  losing  information,  we  decided  to  use  smaller 
patterns  (8*8)  widi  an  highn  numba  of  grey  levels  (64). 

Hg.  1  shows  some  otanqiles  of  patterns  dial  have  been  subjected  to  diis  kind  of  twofold 
prqirocessing  (only  5  grey  levels  are  displayed). 


3  Network  Architecture 


In  view  of  future  hardware  iinplenientation,  we  identified  three  immary  requirements  to  be 
followed  in  the  architectural  definition  of  the  net: 

1)  Every  neuron  must  have  a  limited  number  of  syn^ses  (we  imposed  a  maximum  of  32  plus 
the  threshold); 

2)  Interconnections  must  be  planned  so  as  to  allow  an  easy  routing  of  the  communication  lines; 

3)  No  additional  constraints  specifically  related  to  OCR  are  to  be  imposed,  so  in  {mnciple  the 
same  solution  can  be  directly  utilized  for  other  classification  tasks,  or  scaled  to  fit  tiieir 
requirements. 

We  therefore  designed  a  MLP  provided  with  modular  architecture.  The  number  of  modules 
equals  that  of  the  classes  to  be  distinguished.  Every  module  can  be  viewed  as  a  partially  connected 
subnet  with  only  one  output  neuron. 

Hg.  3  emphasizes  the  general  oiganizatitm  of  the  net:  in  particular,  it  can  be  seen  that  different 
modules  do  not  share  any  connection.  We  can  take  great  advantage  of  tiiis,  because  we  can  plan 
to  {AiysicaHy  realize  only  few  modules  and  then  multiplex  them.  Their  actual  number  can  be 
settled  to  allow  an  efficient  pipeUning  of  the  preprocessing  stage  with  proper  MLP  operation. 
Moreover,  tiiis  highly  parallelkzable  structure  guarantees  low  spatial  cross-talk  among  hidden 
neurons,  thus  resulting  in  a  fairly  high  convergence  rate  during  the  training  phased’). 

Hg.  4  shows  the  inner  structure  of  the  single  module.  The  output  neuron  is  completely 
connected  with  the  hidden  layer  of  32  elements.  At  its  turn,  each  hidden  unit  has  access  to  only  28 
input  con^nents  in  a  cyclic,  sequential  fashion:  i.e.,  inputs  1-28  are  connected  to  the  first 
neuron,  inputs  29-56  are  connect^  to  the  second  neuron,  ..,  inputs  113-128  and  1-12  are 
connected  to  the  fifth  neuron  (that  depicted  dark  in  the  figure),  and  so  on.  These  choices  take  care 
of  two  in^ortant  features:  first,  each  module  covers  the  input  vector  an  integer  number  of  times, 
so  that  the  very  same  connection  scheme  is  preserved  along  the  network;  second,  dififerent  hidden 
neurons  in  the  same  module  are  connected  with  diffotnt  subsets  of  the  input  vector. 


ra-206 


4  Learning  Procedure 


The  qwcial  aichitectme  of  the  net  had  a  profound  impact  on  the  teaming  strategy  itself.  In  fact, 
we  noted  diat  in  this  case  random  initializatitm  of  the  weights  could  be  avoided.  We  therefore 
started  widi  null  values  (an  entire  class  of  MLPs  with  logistic  neurons  and  generic  number  of 
hidden  layers  can  be  initialized  in  this  way.  Details  are  in  Such  a  chance  carries  some 
interesting  properties: 

1)  We  get  rid  of  one  heuristic  parameter,  Le.  the  maximum  absolute  value  of  the  initial  random 
weights; 

2)  Since  neurons  are  provided  with  logistic  transfer  function,  they  he  in  the  farthest  state  foxn 
saturation.  In  other  words,  they  are  maximally  sensitive  to  tiie  error  signal 

Of  course,  it  was  necessary  to  avoid  the  sudden  spreading  of  weight  values  towards  a 
substantially  random  distribution  after  few  updatings.  To  do  this,  two  solutions  appeared  that 
took  tile  serious  drawback  of  slowing  down  of  the  system  evolution.  That  is: 

1)  Performing  a  by  epoch  training; 

2)  Using  low  learning  rate. 

Concerning  the  first  point,  we  found  a  satisfactory  trade-off  by  making  one  update  every  100 
patterns  presented  to  the  net,  thus  realizing  an  intermediate  "batching"  of  the  error  back- 
propagation. 

We  then  started  with  low  teaming  rate  (0.01),  and  then  changed  its  value  according  to  the 
VOgl  adi^ve  techniquei^^i.  After  100  training  epochs,  we  halved  all  the  weights  and  kept  on  with 
the  same  procedure  for  50  additional  epochs.  Although  this  operation  can  be  conside^  a  very 
crude  form  of  weight  decay^  nevertheless  it  have  already  proven  to  be  quite  effectives'll 

Hgures  4  and  5  show  tte  behaviours  of  the  teaming  rate  and  of  the  Mean  Square  Error  (MSE) 
on  the  ouqiuts  vs.  the  number  of  training  epochs.  It  should  be  noted  that,  as  long  as  the  teaming 
rate  increases,  MSE  tends  to  saturate  until  it  stops  decreasing.  When  this  happens,  the  learning 
rate  gets  halved,  thus  allowing  narrower  zones  of  the  error  surface  to  be  explored.  As  a  result, 
MSE  starts  going  down  again. 


laonilng  Rote  «•  Nufnbar  of  TraMnq  E|»eh* 


Epoch 
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MSE  VO  Numbor  of  Training  Epochs 
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5  Pterfomumce  Evaluation 


la  June  1992,  NIST  organized  a  world-wide  contest  with  the  purpose  of  evaluating  the  state  of 
die  art  in  dK  OCR  field.  For  what  concerns  hand-written  numerals,  NIST  provided  a  suggested 
training  set  (die  one  we  previously  described)  and  a  test  set  of  58646  san^les  {Muposely  taken 
from  a  very  different  populadoiL  llierefoie,  recognition  performance  on  the  latter  database  is  very 
revealing  about  the  generalization  capability  of  the  system.  We  preprocessed  such  samples  in  the 
same  way  we  did  for  their  training  counterparts  (except  for  the  fact  that  in  this  case  we  used  only 
16  grey  levels),  and  dien  we  tried  to  classify  them  by  means  of  our  modular  MLP. 

hi  particular,  we  were  mainly  interested  in  dwe^g  the  behaviour  of  the  net  when  constraints 
on  die  lesolutitHi  of  both  memory  and  coirqiuting  elements  are  appliedb^.i^],  We  then  quantized  all 
the  weights  using  6  bits,  and  the  transfer  fiction  of  the  hidden  neurons  using  4  bits.  Here  output 
neurons  are  not  involved.  In  fact,  since  dieir  transfer  function  is  monotonic,  we  can  determine  the 
winning  class  direedy  on  dieir  activations.  In  analog  in^lementations  these  quantities  are  usually 
expressed  in  terms  of  currents,  and  very  siriqile  circuits  can  be  designed  to  select  the  highest 
one**^. 

Widi  diis  ermfiiguration,  when  we  forced  the  net  to  take  a  decision  anyway,  we  achieved  3.69% 
error  rate  on  the  NIST  Test  Set  It  is  worth  nothing  that  performance  worsening  is  very  limited  in 
comparison  with  the  usage  of  floating-point  weights  aixi  neurons,  since  it  amounts  to  about  0.1%. 
This  stems  from  the  fact  that  weight  values  result  more  uniformly  distributed  once  decay  is 
ai^lied  and  additional  training  epochs  performed. 

We  then  rejected  die  most  dubious  cases  by  iirqiosing  lower  diresholds  onto  the  winning 
outcome:  dearly  this  is  not  die  most  elective  solution,  since  the  amount  of  information  provided 
by  the  net  is  not  fully  taken  into  account.  However,  it  has  been  chosen  for  its  simplicity.  Fig.  6 
summarizes  die  results  we  obtained:  dots  in  the  grsqih  show  the  behaviour  of  the  error  rate  with 
regard  to  the  percenti^  of  rejected  sairqiles. 
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Fig.  6 


6  Conclusions 

We  showed  how  a  large,  real-word  task  like  the  recognition  of  hand-written  numerals  may  be 
^doidy  and  economically  accomplished  by  means  of  a  rather  general-purpose  MLP.  In  fact,  our 
classifier  coneedy  recogni^  more  than  96.3%  of  the  san^les  contained  in  the  NIST  Test  Set 
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Therefoie,  it  ranks  sixth  in  the  corresponding  graduated  list  even  in  the  presence  of  the  constraints 
we  inqwsed  for  hardware  feasibility  purposes. 

We  want  to  point  out  here  uie  plainness  of  the  solutions  that  allowed  us  to  achieve  this  result: 
no  complex  features  are  extracted  from  the  raw  data,  no  "a  priori"  knowledge  about  the  problem 
to  be  solved  is  used  in  the  architectural  definition  of  the  net.  Moreover,  the  system  has  a  total 
amount  of  %10  free  parameters:  so  it  is  about  as  large  as  a  conqtletely  connect^  MLP  with  128 
inputs,  oat  hidden  layer  of  tmly  69  units,  10  ou^ts. 
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Abstract 

A  novel  method  for  recognizing  unconstrained  handwritten  nnmerak  based  on  the  self-organization 
of  local  feature  maps  is  presented.  The  proposed  network  is  hierarchically  configured  with  three  blocks 
of  layers.  The  bottom  layer  has  local  feature  maps  that  represent  distinct  shapes  and  locations  of  local 
features  of  individual  nnmerak.  Those  feature  maps  are  self-organized  into  groups  to  represent  the 
fundamental  composition  of  individual  numerak.  The  middle  layer  has  maximum  selection  networks, 
which  generate  an  output  from  each  feature  group  as  the  matching  score  of  a  local  feature  that  has 
maximum  matching  with  the  given  input  sample.  Finally,  the  top  layer  k  a  backpropagation  network 
for  making  a  final  decision  based  on  the  outputs  of  individual  feature  groups  of  each  numeral.  The 
proposed  network  achieves  robustness  to  translation,  rotation,  and  scaling  by  defining  areas  of  feasible 
feature  locations  in  the  feature  maps  defined  in  both  the  Cartesian  and  Polar  coordinates.  Dktortion 
k  handled  by  a  number  of  representjttive  shapes  generated  by  self-organization  of  each  feature  group. 
The  self-organization  of  feature  m^ps  k  accomplished  by  antomatically  recruiting  local  features  and  by 
describing  their  correlations  from  training  samples  based  on  the  evaluation  of  network  performance.  The 
experimentation  with  the  CEDARS  data  base  demonstrates  that  the  proposed  method  k  superior  to  the 
exkting  benchmark  results  [1,  2]. 


1  Introduction 

Handwritten  zip  codes  and  character  data  manifest  that  real  data  are  subject  to  large  amounts  of  distortion, 
scaling,  and  rotation.  But,  many  of  the  previous  approaches  developed  for  off-line  handwritten  character 
recognition  can  only  provide  a  partial  solution  to  those  real-world  problems.  Thus,  it  is  essential  to  develop  a 
recognition  system  robust  to  various  forms  of  deformations  present  in  real  data,  yet  computationally  efficient 
for  real-time  applications. 

As  a  means  of  achieving  the  above  goal,  we  have  proposed  a  Dual  Cooperative  Neural  Network  (DCN)  in 
which  a  Cartesian  Network  (CN)  and  a  Log-Polar  Network  (LPN)  cooperatively  determine  the  pattern  class 
[3].  DCN  is  intended  to  combine  the  strengths  of  the  Cartesian  and  polar  coordinate  data  representations: 
for  instance,  rotated  and/or  scaled  input  patterns  in  Cartesian  coordinates  ^pear  in  polar  representation 
as  horizontally  or  vertically  shifted  patterns,  which  can  be  easily  detected  by  nearby  horizontal  or  vertical 
shift  invariant  feature  detecting  cells  in  polar  coordinate  feature  ms^s.  furthermore,  the  discrimination 
power  of  DCN  is  increased  by  the  two  sets  of  local  feature  maps  that  are  selected  independently  as  the  most 
salient  geometric  features  in  the  Cartesian  and  polar  coordinates,  respectively.  The  proposed  DCN  has  been 
shown  effective  in  handling  handwritten  numeric  patterns  corrupted  by  translation,  rotation,  scaling,  and 
distortion,  as  demonstrated  in  [3]. 

In  tins  paper,  a  new  network  architecture  is  proposed  for  DCN  with  the  emphasis  on  the  self-organization 
of  local  feature  maps.  The  proposed  network  architecture  has  a  hierarchical  structure  with  three  blocks  of 
layers:  a  local  feature  map  layer,  a  maximum  selection/correlation  layer,  and  a  decision  layer  with  back- 
propagation  networks  (BPNs).  In  network  learning,  not  only  the  weights  of  BPNs  in  the  decision  layer  are 
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Figure  1:  Proposed  Network  Architecture 


iteratively  updated,  but  also  the  features  of  the  local  feature  layer  are  automatically  recruited  for  each 
feature  group  whenever  necessary  for  improving  the  network  performance.  The  proposed  self-organization 
of  feature  ms^s  avoids  the  difficulties  of  selecting  sht^e  features  in  Neocognitron  ^proach  [4],  and  of  de¬ 
termining  the  number  of  required  hidden  units  or  local  features  in  the  backpropagation  based  network 
approaches  [1,  2].  The  network  achieves  the  robustness  to  translation,  rotation,  and  scaling  variations  based 
on  the  tolerance  of  feasible  feature  locations  in  the  feature  maps  defined  in  the  Cartesian  and  polar  coordi¬ 
nates.  Whereas,  the  robustness  to  distortion  and  thickness  variations  comes  from  a  variety  of  representative 
feature  shapes  defined  in  the  maps  by  self-organization.  The  proposed  network  has  been  successfully  tested 
with  the  standard  CEDAR  data  base. 

2  Network  Architecture 

Fig.  1  illustrates  the  proposed  network  architecture.  At  the  bottom  layer  are  the  local  feature  maps.  The 
local  feature  maps  are  to  represent  distinct  sh2q>es  and  locations  of  local  features  of  individual  numerals.  To 
represent  the  fundamental  composition  of  individual  numerals,  local  feature  m^s  are  organized  into  groups. 
For  instance,  there  are  5  feature  groups  defined  for  the  numeral  2  in  CN,  and  4  feature  groups  in  LPN. 
For  the  entire  numerals,  47  feature  groups  are  defined  in  CN  and  41  in  LPN.  Each  group  provides  a  single 
matching  score  to  be  used  for  the  final  decision. 

A  feature  map,  composed  of  a  shape  map  and  a  position  map,  represents  a  cluster  of  features  similar  in 
shape  and  location.  The  sh2q>e  map  specifies  the  representative  8hsq>e  of  the  feature  map,  and  the  position 
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Figure  2:  Alternative  Network  Architecture 

map  defines  the  tolerance  in  position  based  on  the  weights  asmgned  to  each  cell  of  the  feature  The 
network  recruits  feature  maps  for  each  group  based  on  self-organization. 

At  the  middle  layer  is  the  maximum  selection/correlation  network.  It  simply  generates  an  output  from 
each  feature  group  as  the  maximum  matching  score  with  the  ^ven  input  sample,  or,  as  the  matching  score 
of  a  feature  m]^>  that  provides  the  maximum  correlation  with  the  given  input  sample  in  a  global  context,  as 
a  more  sophisticated  matching  scheme.  At  the  top  layer  are  BPNs  for  the  final  decision  based  on  the  outputs 
of  individual  feature  groups  of  each  numeral.  A  BPN  can  be  defined  for  CN  and  LPN  either  individually  or 
as  a  whole. 

Fig.  1  illustrates  a  network  architecture  with  three  lasers,  where  the  maximum  selection  network  is  used 
for  the  middle  layer,  and  single  layer  BPNs  are  used  for  generating  outputs  of  CN  and  LPN,  respectively. 
The  outputs  of  CN  and  LPN  are  merged  at  the  top  for  a  final  decision  under  the  consideration  of  the  rank  of 
individual  numerab.  This  network  architecture  emphasizes  mmplidty  so  that  the  network  can  show  a  high 
degree  of  generalization. 

Fig.  2  illustrate  an  alternative,  but  more  sophisticated,  network  architecture.  First,  it  uses  maximum 
correlation  networks,  instead  of  maximum  selection  networks,  at  the  middle  layer.  The  maximum  correlation 
network  maintains  the  correlation  connections  of  feature  maps  among  individual  groups  of  a  numeral,  and 
generates  an  output  from  each  group  that  provides  the  maximum  correlation  with  the  input.  The  correlation 
among  feature  mi^w  for  numeral  2  in  CN  is  shown  in  Fig.  2,  where  each  correlation  node,  MQj,  connects 
the  feature  maps  of  different  groups  to  represent  the  contextual  information  among  feature  maps.  Each 
correlation  node  makes  a  sum  of  the  matdiing  scores  of  its  feature  maps,  and  the  matching  scores  of  the 
fixture  corresponding  to  the  correlation  node  which  scores  maximum  are  selected  for  the  input  to  the 
next  Iqyer.  The  correlation  nodes  are  also  self-organized.  In  addition,  at  the  top  layer,  a  2-layer  BPN  is  used 
for  making  a  final  decision. 
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Figure  3:  (a)  Overall  Learning  Scheme  (b)  Initial  Feature  Map  Generation 


3  Network  Self- Organization 

The  overall  learning  scheme  is  shown  in  Fig.  3(a).  Initial  feature  maps  are  self-organized  from  the  local 
features  selected  by  the  operator.  Then,  based  on  the  initial  feature  ms^s,  the  weights,  WC  and  WP,  of 
BPNs  are  trained  by  the  backpropagation  learning  to  reduce  the  error.  The  weight  training  continues  as  long 
as  the  performance  of  the  network  shows  any  improvement.  When  the  network  performance  is  saturated, 
either  we  generate  new  feature  maps,  or  reinforce  the  existing  feature  maps  based  on  the  rejected  and 
incorrect  samples.  This  process  is  iterated  until  the  network  performance  reaches  a  desired  value.  In  what 
follows,  we  describe  the  above  procedure  in  more  detail. 


3.1  Initial  Feature  Map  Generation 

In  this  stage,  initial  feature  maps  are  self-organized  in  each  feature  group  of  the  numeral  from  teaching 
samples  and  their  features  selected  by  an  operator.  Fig.  3(b)  shows  the  initial  feature  map  generation.  Some 
er^amples  of  teaching  feature  selection  are  shown  in  Fig.  7,  and  their  generated  features  for  numeral  2  are 
shown  in  the  first  row  of  Fig.  5. 

Initially,  there  exists  no  feature  m^  in  each  feature  group.  The  number  of  feature  maps  in  each  feature 
group  are  increased  by  feature  comparison.  When  a  teaching  sample  is  given,  the  first  teaching  feature,  /, 
is  compared  to  the  existing  reference  features  of  the  corresponding  feature  group.  Feature  comparison  is 
done  by  measuring  the  similarity  (5M)  between  the  input  and  the  reference  features.  The  SM  is  defined 

by  SM  =  1  —  DM,  and  DM  is  a  disparity  measurement  with  normalized  sizes.  If  a  similar  feature  dose  not 
exist  in  the  corresponding  feature  group,  that  is,  the  SM  is  less  than  the  given  threshold  (in  our  experiment, 
the  threshold  is  set  at  0.8),  the  input  feature  is  recruited  as  a  new  reference  feature,  Rk.  Thus,  a  new  feature 
mi^  is  generated  by  copying  Rk  =  I  and  by  initializing  the  corresponding  position  weight,  Wk^j,  the  t,  jth 
cell  of  the  kth  feature  map. 


if  Wf'**  =  0 


r  i 

timew  _  }  m  "  Hij  ~ 

\  Wrff  +  — (;^  —  otherwise  '  ' 

N  n  'fn  •ij  ' 

where  m  is  the  number  of  the  teaching  features  for  each  numeral,  and  n  is  the  number  of  updates  of 
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the  corresponding  weight.  Also,  the  position  weights  of  the  nearby  cells,  Wk„,  are  initialized  by  = 
where  C?(p,9)  is  a  slightly  decreasing  function  by  the  cell  position,  and  has  the  highest  value  in 
t,jth  position.  These  neighboring  weights  are  initialized  together  to  tolerate  positional  errors  in  CN  and  to 
tolerate  scale  and  rotational  errors  in  LPN. 

Then,  the  next  input  feature  is  compared  to  the  existing  features  of  the  corresponding  feature  group.  If 
there  exist  similar  features,  the  most  similar  feature  is  modified  to  accommodate  this  new  input  feature  by 
updating  the  feature  msq;)  : 

-  /?!'«'),  (2) 

where  n  is  the  number  of  updates  for  that  reference  feature.  Thus,  the  local  features  represented  in  the  feature 
m^>s  have  blurred  shapes  as  the  result  of  the  generalization  of  accommodated  samples.  This  provides 
the  local  feature  maps  with  some  capability  of  handling  local  distortions.  Also,  the  position  weights  of 
corresponding  and  nearby  cells  are  reinforced.  These  procedures  are  repeated  until  there  are  no  more 
teaching  samples  selected  by  the  operator. 

3.2  Weight  Tiraining 

Since  the  initial  feature  maps  are  formed  by  the  local  feature  information  of  the  given  teaching  patterns, 
the  network  needs  to  accommodate  global  information  of  teaching  patterns  by  training  WC  and  WP  for  the 
different  contribution  of  each  feature  group  in  each  numeral. 

The  weights  are  trained  by  the  backpropagation  learning  [5]  to  reduce  network  error,  Ep,  defined  by 

^p~  2  ~  (3) 

t 

Here,  ip^  is  the  desired  output  value  (1  for  true,  -1  for  others)  and  Op.  is  the  actual  output  value  for  the  ith 
output  unit  of  the  pth  teaching  pattern  with  a  sigmoidal  activation  function.  And,  the  overall  network  error, 
Ermt,  ia  defined  by 


where  M  is  the  total  number  of  class,  and  N  is  the  total  number  of  training  patterns.  Weights  are  updated 
by 

=  ^  =  ,(1  -  oDitp,  -  Op,)MQi,  (5) 

where  i;  is  a  learning  coefficient  (0.01  in  our  experiment),  and  MCij  (or  MPij)  is  a  maximum  matching  score 
of  each  feature  group. 

During  the  experiment,  the  patterns  were  repeatedly  presented  in  a  constant  order.  The  weights  were 
updated  after  each  presentation  of  a  single  pattern  rather  than  updated  by  a  true  gradient  procedure  (av¬ 
eraging  over  the  whole  training  set  before  updating  the  weights),  due  to  a  large  redundancy  in  the  data 
base  [1]. 

3.3  Self-Organization  of  Feature  Maps 

After  each  iteration  of  weight  updates,  the  error  rsAe  and  Ermi  were  measured.  If  the  error  rate  was 
improved,  weight  updates  were  continued.  But,  when  there  is  no  more  improvement  in  error  rate  with 
iterative  weight  training,  the  network  may  need  new  shape  features  from  the  rejected  and  incorrect  training 
samples.  For  the  selection  of  optimal  or  near  optimal  features,  a  self-organization  of  local  feature  sets  for 
individual  characters,  as  a  part  of  supervised  network  learning  process,  is  needed.  Thus,  new  shape  features 
are  captured  and  added  to  the  network,  and  similar  features  are  accommodated  to  the  existing  features  of 
the  network  from  the  training  samples. 

Fig.  4  shows  examples  of  the  feature  map  gener2U;ion/modification.  Initially,  only  the  first  2,  which  was 
selected  by  the  operator,  was  registered  in  the  network  as  a  reference  sample.  Assuming  that  the  second  2  is 
given  to  the  network  and  the  network  result  is  unclear  (possible  for  rejection)  or  incorrect,  then  new  feature 
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Figuie  6:  UnoMisttaiiied  Handwritten  Numerala  fr<»n  2Sp  Codes  (Examples  of  oonectly  reoogniied  test 
patterns) 


maps  can  be  generated  from  tbi«  sample.  Firstly,  the  output  response  of  each  feature  group  is  examined. 
To  select  a  position  of  a  new  feature  in  eadi  feature  group,  the  position  with  the  highest  matdiing  score 
am<mg  the  currently  registered  feature  maps  is  selected.  And,  a  new  feature  is  captured  from  the  rejected 
cx  incorrect  pattern  according  to  that  position  information.  Then,  the  shape  of  the  captured  feature  is 
omnpared  to  the  *!Ti«ting  features  in  the  same  feature  group.  If  there  exists  a  similar  feature,  the  ahape  and 
positions  (weights)  are  accommodated  by  averaging.  If  not,  this  feature  is  legisteied  as  a  new  feature  in  that 
feature  group  by  creating  a  new  feature  map  (shH>e  And  positions).  Thus,  for  the  second  2,  one  feature  is 
blurred  to  the  existing  similar  feature,  and  four  features  are  genaated  and  roistered  in  each  feature  group. 
For  the  third  2,  only  one  feature  is  generated,  and  four  other  features  are  accommodated  to  the  existing 
features.  This  process  is  continued  until  all  the  rejected  and  incorrect  patterns  ace  examined. 

Fig.  5  ahows  the  generated  reference  features  frn  numaal  2  in  CN  with  200  training  patterns.  Features 
in  the  first  row  diow  the  generated  refinenoe  features  in  each  feature  group  from  the  operator  given  teadbing 
features.  G1  indicates  the  first  generation  of  new  features  after  the  30th  iteraticms  of  wright  updates,  and 
G2  and  G3  show  the  next  generations  of  new  features. 

4  Experiment 

In  thin  experiment,  the  proposed  network  with  maximum  selection  network  and  single  layer  BPNs  in  each  CN 
and  LPN  was  implemented  and  tested.  Weights,  WC  and  WP,  ace  fully  connected  with  10  output  units,  and 
each  correspondi^  outputs  of  the  netwmks  are  merged  by  adding  output  score  and  rank  togetha.  When  an 
input  pattern  is  given,  it  is  sise  normalised  to  58x58  for  CN  and  log-polar  transformed  to  23x70  for  LPN 
(instead  of  23x60  to  handle  boundary  p<»tions  of  the  transformed  input). 

4.1  Data  Base 

Real  handwritten  sip  code  data  provided  by  CEDAR  were  used  for  the  e:q>eriment.  These  are  binary 
handwritten  digits  segmented  from  sip  codes  with  a  resolution  of  300  ppi  (12  pixek/mm).  Some  examples 
are  shown  in  Fig.  6.  It  can  be  seen  that  the  data  usually  contain  iislortion,  scaling,  thickness  variations, 
rotation,  translation,  noise,  etc.  Ftom  this  data  base,  the  initial  2000  patterns  (200  for  eath  numeral)  were 
used  for  training,  and  2213  patterns^  were  used  for  testing  in  this  e:q>eriment. 

*  CEDAR  recomuwatind  tort  let 
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Figure  8:  Network  Performaace  :  with  rejection  when  the  difference  of  two  strongest  outputs  are  within  5%. 


Table  2:  Various  Test  Results  for  2213  Test  Patterns 


End  of 

Initial 

Gl 

w/o  rejection 

95.2  0.0  4.8 

96.3  0.0  3.7 

96.4  0.0  3.6 

96.7  0.0  3.3 

rejection  w/  5% 

93.9  3.3  2.8 

95.2  2.0  2.8 

95.3  2.2  2.6 

95.4  2.3  2.3 

2.0%  error 

92.9  5.1  2.0 

93.5  4.5  2.0 

94.8  3.3  2.0 

95.0  3.0  2.0 

1.0%  error 

87.3  11.7  1.0 

90.6  8.4  1.0 

92.5  6.5  1.0 

93.0  6.0  1.0 

4.2  Results 

Initial  networks  were  formed  with  about  20  to  30  training  samples  of  each  numeral  in  each  network.  Examples 
of  operator  sdection  of  teaching  features  are  shown  in  Fig.  7.  The  number  of  selected  features  are  different 
depending  on  the  numeral.  47  features  were  selected  in  CN  and  41  in  LPN.  Table  1  shows  the  number  of 
generated  reference  features  in  eadi  feature  group  of  the  numeral  with  2000  training  patterns.  At  Gl,  many 
new  feature  were  generated,  but  at  G2  and  G3,  only  small  number  of  new  features  were  generated  and  added. 

The  network  performance  is  shown  in  Fig.  8.  Each  iteration  involves  presentation  of 2000  training  samples. 
In  Fig.  8(a),  DON  (solid  line)  represents  merged  results  of  CN  (broken  line)  and  LPN  (dotted  line).  F^om 
the  initial  networks,  after  25  iterations  of  weight  updates,  the  network  performance  was  almost  saturated, 
that  is,  no  more  improvement  in  accuracy  was  obtained.  Thus,  at  Gl,  new  features  were  generated  from 
the  rejected  and  incorrect  patterns  by  the  network  self-organization.  By  recruiting  new  necessary  features, 
the  network  performance  was  immediately  improved  both  for  truning  and  testing  patterns.  Fig.  8(b)  shows 
the  rejected  and  incorrect  training  patterns  as  learning  progresses.  Fig.  8(c)  and  (d)  show  the  results  with 
2213  test  patterns.  We  can  see  the  effectiveness  of  the  dual  cooperation  of  networks  by  the  performance 
improvement  particularly  in  the  testing  patterns.  Test  results  show  95.4%  of  accuracy  with  2.3%  rejection 
and  2.3%  error  for  test  patterns,  and  99.8%  of  accuracy  with  0.2%  rejection  for  training  patterns. 

But,  for  a  practical  and  accurate  evaluation  of  the  proposed  network,  and  for  a  fair  comparison  with  [1], 
we  have  measured  the  accuracy  and  rejection  rate  for  1%  and  2%  error  of  the  test  patterns.  Those  results 
are  shown  in  Table  2.  We  can  see  the  robustness  of  the  proposed  network  particularly  in  2%  and  1%  error 
measurement.  At  the  end  of  G3,  our  system  rejected  3.0%  for  a  2.0%  error  and  6.0%  for  a  1.0%  error,  which 
is  about  3%  better  than  [1]  in  recognition  rate.  Fig.  9  shows  incorrectly  recognized  test  patterns  for  the  1% 
error  (Fig.  9(a)),  for  2%  error  (Fig.  9(arb)),  and  for  rejection  within  5%  (Fig.  9(arc)).  Some  of  the  patterns 
were  incorrectly  recognized  mainly  due  to  their  large  size. 

With  this  experiment,  we  can  conclude  that  the  network  performance  can  be  greatly  improved  1)  by 
combining  the  iterative  weight  training  of  BPNs  and  the  automatic  recruitment  of  local  feature  m^>s  in  each 
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Figure  9:  Incorrectly  Recognized  Test  Patterns  ;  (a)  1%  error,  (arb)  2%  error,  and  (a-c)  rejection  w/  5% 


feature  group,  and  2)  by  combining  two  independent,  cooperative  networks,  CN  and  LPN.  Comparing  to 
other  works  [1, 2, 4],  our  proposed  network  with  anew  learning  scheme  was  tested  with  totally  unconstrained 
handwritten  numerals  without  any  complex  rotation  and/or  thickness  normalization  steps  and  without  any 
noise  removal  process. 

5  Conclusion 

In  this  pa^er,  a  new  hierarchical  network  structure  based  on  self-organization  of  feature  maps  to  recognize 
unconstrained  handwritten  numerals  has  been  discussed,  and  very  promising  results  were  obtained.  Our 
proach  can  give  robustness  to  various  deformations  such  as  translation,  rotation,  scaling,  and  even  distortion 
and  thickness  variation.  Complete  experiments  by  including  the  maximum  correlation  layer  and  by  adding 
more  training  patterns  are  in  progress. 
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Abstract:  Classification  methods  often  perform  significantly  below  Bayesian  limits  in  com¬ 
plex,  high-dimensional  classification  tasks  because  of  model  bias,  inadequate  training  data  and 
noise/ variability  in  the  data.  When  several  classifiers  are  used  for  a  given  task,  selecting  one 
■Mthed  over  all  others  discards  potentially  valuable  information.  Strategies  aimed  at  suitably 
combining  the  results  of  multiple  classifiers  are  expected  to  perform  better  than  any  single 
method,  and  reduce  overall  bias  and  noise.  An  underwater  passive  sonar  data  set  consisting 
of  over  1000  samples  processed  to  produce  different  25-dimensional  and  24-dimensional  feature 
vectors  is  used  in  this  study  to  examine  an  evidence  combination  framework.  An  analysis  of  the 
conditions  that  the  data  sets  must  satisfy,  and  the  conditions  under  which  improvements  can  be 
obtained  is  provided,  and  results  are  presented  for  hybrid  networks  using  both  local  and  global 
classifiers. 

1  Introduction 

Supervised  feed-forward  neural  networks  have  been  applied  to  numerous  classification  problems 
in  signal  processing  and  pattern  recognition  [5].  These  include  the  Multi-Layer  Perceptron 
(MLP)  employing  sigmoidal  ‘‘hidden  units,”  as  well  as  kernel-based  classifiers  such  as  those 
using  Radial  or  Elfiptical  Basis  Functions  (EBFs)  [2,  6].  Such  networks  can  serve  as  non- 
parametric,  adaptive  classifiers  that  learn  through  examples  [5],  without  requiring  a  good  a 
priori  mathematical  model  for  the  underlying  signal  characteristics. 

Finding  the  best-suited  network  and  the  optimal  selection  of  features  for  classification  is  not 
generally  possible  beforehand.  Concatenating  all  types  of  signal  descriptors  into  a  single  input 
vector  is  undesirable  for  several  reasons.  First,  a  large  input  layer  may  lengthen  the  training 
time  and  compUcate  parameter  selection.  Second,  mixing  conceptually  different  features  may 
decrease  the  relative  importance  of  the  most  discriminating  features.  Therefore,  it  may  be 
beneficial  to  train  separate  networks  on  distinct  data  sets  obtained  from  the  same  physical 
signal  by  using  qualitatively  different  feature  extractors.  Similarly,  different  types  of  FFNs  have 
different  characteristics.  For  example,  EBFs  are  more  locally  tuned  than  MLPs.  Furthermore 
each  network  introduces  some  bias,  and  combining  different  networks  can  reduce  the  bias  and 
make  the  classifier  more  robust  [7]. 

In  the  context  of  supervised  feedforward  networks,  interpretation  of  network  outputs  as 
Bayesian  a  posteriori  probabilities  [8]  provides  a  sound  basis  for  combining  the  results  from 
multiple  classifiers  to  yield  more  accurate  classification  [1,  3,  4].  The  concept  of  stacked  gen¬ 
eralization,  an  inductive  approach  to  combining  generalizers,  has  been  recently  introduced  by 
Wolpert  [9].  A  framework  for  hybrid  neural  networks  in  regression  estimates  was  discussed  in 


In  this  paper,  we  focus  on  the  statistical  aspects  of  such  combining  methods.  Errors  made  by 
different  classifiers  are  labeled  and  used  in  estimating  potential  improvements.  First  we  provide 
a  motivating  example  and  analyze  the  properties  of  both  correctly  and  incorrectly  classified 
patterns.  Then,  we  present  a  framework  in  which  combination  results  can  be  studied.  Finally, 
we  show  how  the  theory  can  be  applied  to  the  data  set  of  Section  2. 


2  A  Motivating  Real  Life  Example 


In  this  section  we  motivate  the  combination  framework  by  studying  a  real  life  classification 
problem,  where  the  objective  is  to  correctly  identify  different  underwater  acoustic  signals.  Since 
neither  the  important  features  of  the  data,  nor  the  best  classifier  to  process  them  is  known 
beforehand,  two  classifiers  and  two  features  sets  are  used  in  the  study. 


Table  1:  Data  Description. 


Feature  Set  1 

Feature  Set  2 

Class 

Description 

Training 

Testing 

Training 

Testing 

m 

Porpoise  whistle 

MSM 

284 

142 

284 

mM 

Ice  cracking 

■9 

175 

175 

175 

Whale  cry  1 

116 

129 

129 

129 

■i 

Whale  cry  2 

148 

235 

118 

235 

Total 

496 

823 

564 

823 

The  first  classifier  (Cl)  is  a  fully  connected  MLP  with  a  single  hidden  layer  consisting  of  50 
units,  and  the  second  classifier  (C2)  is  an  EBF  with  50  kernels.  In  the  first  feature  set  (FSl),  each 
sample  is  represented  by  a  25-dimensional  vector  comprising  of  16  Gabor  wavelet  coefficients, 
8  other  temporal  descriptors  and  1  signal  duration  indicator  [1].  The  second  feature  set  (FS2) 
contained  24-dimensional  vectors  each  comprising  of  10  reflection  coefficients  of  energy  segments 
obtained  through  short  time  windows,  10  reflection  coefficients  computed  over  the  entire  window, 
3  temporal  descriptors,  and  1  signal  duration  indicator.  Table  1  shows  the  number  of  classes, 
and  the  number  of  training  and  test  samples  available  for  each  feature  set.  The  test  patterns  in 
each  feature  set  represent  the  same  raw  data. 


Table  2:  Results  of  Individual  Classifiers. 


Classifier/ 
Feature  set 

%  Correctly 
Classified 

Standard 

Deviation 

95%  Confidence 
Interval 

Cl/FSl 

92.66 

.63 

92.21-93.11 

C1/FS2 

88.60 

.73 

88.08-89.12 

C2/FS1 

91.30 

.64 

90.84-91.76 

C2/FS2 

82.02 

2.22 

80.43-83.61 

Table  2  provides  the  classification  results  for  each  individual  classifier/feature  set  pair.  The 
best  performance  is  achieved  by  Cl  using  FSl.  A  naive  approach  is  to  select  this  combination 
and  ignore  the  other  three.  An  important  fact  that  must  be  remembered  however  is  that  since 
each  pattern  is  assigned  to  the  class  whose  output  unit  has  the  largest  activation  value,  valuable 
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information  may  be  discarded  during  this  “max”  selection  step,  A  more  scrutinizing  look  reveals 
that  there  are  significant  fluctuations  in  the  activation  values  of  the  winning  classes,  depending 
on  the  classifier  used  and  on  whether  the  class  chosen  was  in  fact  the  correct  one  or  not.  Since 
these  activation  values  approximate  a  posteriori  probability  distributions  [8],  they  can  be  used  to 
combine  information  in  different  ways.  Two  such  combiners  will  be  examined  in  this  study.  The 
first  combiner  (AVE)  will  average  the  output  activities  of  the  sources  and  select  the  maximum. 
The  second  combiner  (MAX)  will  assign  the  pattern  to  the  class  whose  output  has  the  largest 
activation  among  both  sets  of  output  vectors.  In  the  following  section  we  provide  a  framework 
in  which  the  combination  results  can  be  interpreted  and  studied. 


3  Combining  Framework 


In  this  section  we  formalize  the  conditions  that  are  necessary  for  a  combiner  to  improve  the 
classification  rate  of  different  classifiers.  Given  any  data  set,  the  first  stage  in  classification  tasks 
is  to  extract  the  features  that  will  be  used  as  inputs  to  the  classifier.  Different  feature  sets 
wifi  capture  different  aspects  of  the  data.  Similarly  different  networks  will  emphasize  different 
properties  of  the  input  space.  If  a  combiner  is  to  improve  the  results,  it  must  utilize  all  the 
pieces  of  information  that  are  available  through  the  individual  classifiers.  This  section  develops 
a  framework  highlighting  the  conditions  under  which  such  improvements  can  be  expected. 

Let  Z?  be  a  data  set,  and  let  Dtr  and  Dtst  be  a  partition  that  represents  the  training  and 
test  sets  respectively.  A  network  /  assigns  a  pattern  x  €  Dta  to  class  C,  if  the  output  unit 
/(z)j,  has  the  largest  activation  value  among  all  output  units.  Let  d(-,  •)  be  a  distance  metric  in 
the  pattern  space.  An  €-neighborhood  N{x\  e)  of  x  is  the  set  of  all  points  y  such  that  d(x,  y)  <  e. 
A  deleted  c-neighborhood  A'*(x;  c)  of  x  is  N{x\()  —  {x},  i.e.,  the  point  x  is  removed. 

Definition  1  A  data  set  D,  partitioned  into  n  classes  Ci, . . . ,  C„  ,  is  consistent  if  'ix  €  D  , 
3  a  deleted  e-neighborhood  iV*(x;e)  of  x  s.t.  Vy  €  iV*(x;e)  ,  x  ^  Ci  ^  y  €  Ci.  Furthermore, 
N\x;e)  n  Ar  #  ^  and  N'{x-,e)  f)  #  4>- 

A  data  set  is  consistent  if  there  are  no  point-classes,  i.e.  classes  consisting  of  single  isolated 
points.  It  is  important  to  note  that  consistency  does  not  require  classes  to  be  contiguous,  just 
that  two  points  sufficiently  close  belong  to  the  same  class,  and  that  the  training  and  test  sets 
represent  the  data  equally  well. 

Definition  2  Let  E[f{x)i]  and  E[f{z)i]  represent  the  average  activations  of  the  outputs  corre¬ 
sponding  to  the  correct  class,  computed  over  N(x;e)  and  N{z;e')  respectively.  A  ^ata  set  D  is 
balanced  with  respect  to  a  given  function  f,  if  E[f{x)i]  =  E[f{z)i],  for  any  two  arbitrary 

N{x;e)  and  N{z;e'). 

A  data  set  is  balanced  if  the  expected  activation  value  of  correctly  classified  outputs  is 
independent  of  the  samples  chosen. 

Definition  3  A  network  f  is  properly  trained  if  Vx  €  Dtat  that  has  been  assigned  to  class  Ci, 
f{x)i  is  a  monotonically  non-increasing  function  of  d{x,y)  ,  where  y  Ci,  y  £  Dtr.  Moreover 
'iz  ^  Dtr  ,  d{x,y)  <  d{x,z). 
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A  network  is  properly  trained  if  the  largest  activation  value  of  the  output  for  a  test  pattern 
is  a  non-increasing  function  of  its  relative  distance  of  the  pattern  to  the  closest  training  pattern 
of  the  correct  class. 

Theorem  1  A  network  properly  trained  on  a  balanced,  consistent  data  set  will  have  higher 
expected  activation  values  for  correctly  classified  patterns  than  for  inconectly  classified  patterns. 

Proof:  Let  x  €  Further,  let  x  be  misclassihed  in  class  C,  (target  class  was  j  ^  i).  Let 

y  €  be  the  closest  pattern  in  C,  to  x.  Since  the  data  set  is  consistent,  there  exists  a  deleted 
(-neighborhood  N'‘{y,€)  of  y  such  that  N*{y;e)  C  C,.  Since  x  ^  Ci,  we  also  have  x  ^  N*{y;e). 
Therefore  d{x,y)  >  d(z,y),  'iz  €  iV*(y;c),  and,  since  /  is  properly  trained  f{z)i  >  /(x)j, 
Vz  €  iV*(y,€)  fl  Dttt-  So,  E[f{z)i]  >  /(x)j  where  fJ[-]  is  computed  over  N(y,e)  H  A*<-  Now, 
for  all  the  erroneously  classified  patterns  x,  such  a  neighborhood  exists.  Furthermore  since 
the  data  set  is  balanced  the  expected  valued  of  the  activation  for  correct  patterns  is  constant. 
Therefore  E[f(z)i]  >  f{x)i,  for  all  erroneously  classified  patterns  x,  on  any  neighborhood 
N{z,e)  C  Dtai.  Therefore  E[f(z)i]  >  £[/(x),],  where  z  and  x  are  correctly  and  incorrectly 
classified  patterns  respectively.  □ 

Definition  4  Two  data  sets  Di  and  D2  are  mutually  balanced  if  both  are  balanced  with  respect 
to  f  and  E[f(x)i]  =  E[f{z)i],  for  any  two  arbitrary  N(x\e)  C  Di  and  N(z;€')  C  Z?2- 

It  is  important  to  use  mutually  balanced  data  sets  with  combiners  in  order  to  avoid  overem¬ 
phasizing  one  classifier  over  another.  If  the  output  activations  of  two  classifiers  differ  greatly, 
selecting  the  output  with  the  largest  value,  or  performing  an  arithmetic  average  will  not  nec¬ 
essarily  improve  the  results.  The  contribution  of  Theorem  1  is  that  it  predicts  the  possible 
improvements  after  a  simple  examination  of  the  classifier  outputs.  If  the  average  output  acti¬ 
vation  of  correctly  classified  patterns  is  higher  than  the  average  output  activation  of  incorrectly 
classified  patterns,  then  the  AVE  or  MAX  combiners  are  expected  to  correct  errors  where  at 
least  one  of  the  classifiers  provided  the  correct  response^. 

4  Results 

The  previous  section  provided  a  framework  in  which  combination  results  can  be  anticipated. 
Furthermore,  Theorem  1  provides  some  insight  on  when  improvements  can  be  expected,  namely 
when  on  an  average,  the  activation  values  for  the  correctly  classified  patterns  <ire  higher  than 
the  activation  values  for  the  incorrectly  claissified  patterns,  for  both  classifier/feature  sets  that 
will  be  combined. 

Since  the  basis  for  combining  is  that  correctly  classified  patterns  carry  more  weight  (or  more 
information)  than  incorrectly  classified  patterns,  it  stands  to  reason  that  as  long  as  the  correctly 
classified  patterns  are  more  dependable,  combining  will  improve  results.  Table  3  provides  the 
average  activation  value  for  the  highest  outputs  (winning  classes)  of  each  network  and  data  set 
combination  for  which  the  correct  classification  percentages  were  presented  in  Table  2.  The 

the  values  aie  very  similat,  hypothesis  testing  may  be  conducted  to  estimate  how  many  patterns  can  be 
expected  to  be  corrected. 


Table  3;  Average  Output  Activation  Values. 


Correctly  Classified  Patterns 

Incorrectly  Classified  Patterns 

Classifier  /  Data  Set 

Activation 

Variance 

Activation 

Variance 

Cl  -  FSl 

.9723 

.7811 

.0666 

Cl  -  FS2 

.9537 

.5901 

.1628 

C2  -  FSl 

.7739 

.4786 

.0315 

C2  -  FS2 

.6346 

.1432 

.3850 

.0709 

averages  are  computed  over  correctly  and  incorrectl>  classified  patterns.  For  example,  the  first 
row  says  that  when  Cl  was  used  on  FSl,  the  average  activation  \  'ie  of  a  correctly  classified 
pattern  was  .9723,  while  the  average  activation  value  for  an  incorrectly  classified  pattern  was 
.7811. 


Table  4:  Combination  Results 


BEST 

AVE 

MAX 

LIMIT 

C1-FS1/C1-FS2 

92.66 

95.24 

93.46 

97.21 

C2-FS1/C2-FS2 

91.30 

93.34 

91.76 

96.96 

C1-FS2/C2-FS1 

91.30 

92.95 

92.10 

96.84 

C1-FS1/C2-FS1 

92.66 

93.03 

93.35 

95.02 

C1-FS2/C2-FS2 

88.60 

89.14 

88.60 

92.71 

C1-FS1/C2-FS2 

92.66 

93.92 

93.68 

96.96 

With  four  different  ways  of  processing  the  data,  there  are  six  possible  pairs  of  combina¬ 
tions.  Table  4  shows  the  combination  results  for  each  of  the  six  pairs,  using  the  MAX  and  AVE 
combiners.  The  best  result  of  the  two  sources  (BEST)  provides  the  base  to  measure  the  improve¬ 
ments  due  to  combination.  The  last  column  provides  a  theoretical  limit  on  the  improvements, 
as  obtained  by  a  combiner  that  corrects  all  patterns  that  are  correctly  classified  by  at  least  one 
classifier.  The  combinations  that  provided  statistically  significant  improvements  for  at  least  one 
of  the  combiners  are  given  in  the  first  two  rows.  The  combination  in  the  third  row  provided 
marginal  improvements  that  were  at  the  threshold  of  statistical  significance-  An  analysis  of 
Table  3  reveals  that  for  the  two  combinations  that  provided  statistically  significant  results,  the 
activation  value  for  the  correctly  classified  patterns  in  each  sources  was  larger  than  the  activa¬ 
tion  value  for  the  incorrectly  classified  patterns  on  both  sources.  For  the  third  combination  the 
average  activation  of  incorrectly  classified  patterns  in  one  classifier  was  statistically  comparable 
to  the  average  activation  of  correctly  classified  patterns  in  the  other  classifier  (77.39  ~  78.11, 
statistically).  For  the  remaining  three  combinations  where  there  weren’t  any  statistically  sig¬ 
nificant  improvements,  the  average  activation  of  one  classifier  on  incorrectly  classified  patterns 
was  higher  than  the  average  activation  of  the  correctly  classified  patterns  on  the  other  classifier. 


5  Discussion 

A  framework  predicting  combining  results  was  developed  and  the  predictions  were  tested  using 
underwater  acoustic  signals.  The  results  reveal  that  the  the  framework  was  successful  in  pre¬ 
dicting  when  combining  would  provide  significant  improvements.  However,  the  improvements 
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were  short  of  the  limits  predicted  by  the  theory.  This  discrepancy  is  mainly  due  to  the  fact  that 
the  two  feature  sets  are  not  totally  independent,  and  noise  and  bias  are  not  the  only  reasons  a 
pattern  is  incorrectly  classified.  Outliers,  for  example,  are  incorrectly  classified  regardless  of  the 
features  extracted  to  represent  them.  Furthermore,  signal  classes  are  not  necessarily  disjoint. 
Thus,  a  more  general  framework  where  independence  of  classifiers  and  feature  sets  is  not  used, 
needs  to  be  developed.  Furthermore  this  framework  has  to  handle  not  only  disjoint  class  mem¬ 
berships,  but  also  probabilistic  classification,  where  only  the  likelihood  of  class  membership  may 
be  known. 
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Abstract  -  An  Adaptive  Resonance  Theory  (ART)  neural  network  algorithm  is  used  to  aid  the  human  operator  in 
the  recognition  and  classification  of  target  images  from  a  synthetic  aperture  radar  (SAR).  The  plasticity-stability 
properties  of  ART  allow  realtime  classification  of  both  familiar  and  unfamiliar  images.  Computer  simulation 
demonstrates  that  the  effectiveness  of  this  approach  in  a  noisy  environment  is  dependent  on  proper  utilization  of 
model  parameters.  The  goal  is  an  intelligent  system  which  is  capable  of  classifying  surface  ship  combatants  while 
autonomously  adapting  in  realtime  to  unexpected  changes  in  the  real  world. 

L  Introduction 

Neural  networks  have  been  touted  as  a  powerful  solution  for  many  pattern  recognition  applications  [  1 1,  [6).  The 
concept  of  a  "trainable"  computer  holds  a  great  deal  of  potential  for  command  and  control  systems  that  need  to 
dynamically  adjust  to  continuously  changing  and  unforeseen  circumstances.  One  of  the  desired  objectives  for  neural 
networks  is  the  ability  to  aid  the  human  decision  making  process  by  off-loading  some  of  the  information  processing 
that  must  be  accomplished  prior  to  making  a  decision. 

In  combat  systems  today,  there  is  usually  no  shortage  of  information.  In  fact,  there  is  usually  more  raw  data 
than  one  person  can  process  in  a  reasonable  amount  of  time  without  some  additional  analysis  aids.  The  military  is 
continually  searching  for  the  best  decision  aids,  assuming  that  they  are  reliable,  so  that  operators  may  perform  more 
effectively,  especially  in  a  combat  situation.  The  ability  of  the  computer  to  do  certain  tasks,  such  as  spatial  and 
temporal  pattern  recognition,  at  which  the  human  brain  is  adept,  leads  us  to  artificial  neural  networks  which  attempt 
to  assimilate  those  functions  that  the  operator  is  too  busy  to  do. 

In  [3],  Carpenter  and  Grossberg  apply  the  adaptive  resonance  theory  (ART)  to  pattern  recognition  of  binary  data. 
They  have  also  applied  ART  2  to  pattern  recognition  of  analog  data  in  [2].  .Srinivasa  and  Jouaneh  [8]  utilize  a 
combined  invariance  net  and  ART  1  network  to  achieve  pattern  recognition  of  binary  inputs.  In  this  paper,  we  will 
explore  a  particularly  useful  application  of  the  ART  neural  network  to  combat  systems  target  detection,  recognition 
and  classification.  The  input  data  used  for  this  application  are  synthetic  aperture  radar  images.  The  SAR  provides 
unique  target  signatures  that  can  be  converted  to  binary  patterns  for  classification  using  ART  1  The  goal  of  this 
project  is  to  devise  an  intelligent  system  which  is  capable  of  classifying  surface  ship  combatants  while  autonomously 
adapting  in  realtime  to  unexpected  changes  the  real  world  scenarios. 

The  original  adaptive  resonance  theory  algorithm  developed  by  Carpenter  and  Grossberg  is  a  two  layer  neural 
network  with  interacting  layers  [3].  It  was  designed  for  the  learning  of  recognition  categories.  The  significant  feature 
of  the  ART  network  is  that  it  learns  or  adapts  to  new  inputs  while  at  the  same  time  attempts  to  retain  its  previously 
learned  information  in  some  stable  state.  This  ability  to  learn  while  retaining  old  information  is  known  as  the 
plasticity-stability  dilemma  [1].  Essentially,  the  ART  should  be  able  to  process  and  learn  unfamiliar  events  while 
remembering  familiar  events  without  re-classifying  them.  The  ART  used  for  this  paper  is  basic  ART  1,  which  uses 
binary  data  for  its  inputs.  ART  2,  which  will  be  discussed  later,  is  a  follow  on  to  ART  1  and  allows  for  analog  as 
well  as  binary  input  data. 
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Figure  1.  A  Typical  ART  1  Architecture  [2] 


The  one  significant  difference  between  an  ART  architecture  and  other  learning  schemes  is  that  it  is  designed 
to  learn  quickly  and  stably  in  realtime  in  response  to  a  changing  world  with  an  unlimited  number  of  inputs  until 
it  runs  out  of  memory.  An  ART  network  can  classify  information  by  adding  neurons  to  its  recognition  layer 
dynamically  [3],  New  learning  does  not  replace  previously  learned  situations  and  the  vigilance  parameter  can  be 
adjusted  to  allow  for  a  realistic  number  of  neurons  to  be  used  in  the  structure  of  the  network. 

n.  Application  of  ART  in  a  Synthetic  Apertuie  Radar  Environment 

For  the  purposes  of  this  paper,  several  assumptions  and  generalizations  are  made  concerning  the  characteristics 
of  the  synthetic  aperture  radar  [7].  It  is  assumed  that  the  radar  is  capable  of  providing  adequate  resolution  to 
accurately  train  the  network.  The  radar  return  is  approximated  by  a  silhouette  of  some  arbitrary  surface  vessel.  Five 
different  types  of  ships  were  processed  with  significantly  different  silhouettes  for  classification.  To  investigate  the 
effect  of  varying  target  angles,  these  silhouettes  were  also  rotated  in  perspective  views.  In  all,  each  ship  was 
presented  from  6  different  target  angles. 

The  silhouette  of  the  ship  becomes  the  input  to  the  ART.  In  actual  practice,  the  return  would  be  processed  prior 
to  analysis  using  equipment  onboard  the  SAR  platform.  A  16  by  36  grid  is  overlaid  on  the  silhouette  (see  Figure 
2(a)).  This  grid  is  processed  so  that  a  square  is  represented  as  a  binary  one  if  more  than  50%  of  that  square  contains 
some  return  of  the  silhouette.  The  result  is  a  16  by  36  grid  of  ones  and  zeros  which  is,  effectively,  a  binary 
representation  of  the  radar  return  (see  Figure  2(b)). 

The  grid  is  assumed  to  be  sized  to  the  target  in  order  to  maximize  the  target  usage  of  the  grid  from  several 
different  target  angles.  The  grid  can  be  sized  based  on  the  range  to  the  target  and  the  target's  approximate  size  based 
on  the  return.  The  inherent  assumption  is  that  the  SAR  has  a  fine  enough  resolution  return  to  accurately  represent 
the  same  class  of  ship  from  different  target  angles  over  a  wide  range  of  distances  to  the  target.  Each  of  these  target 
angles  would  be  classified  differently  according  to  the  ART;  however,  they  are  correlated  at  the  output  by  some 
post-processing  mechanism  to  give  th '  operator  the  target  classification,  distance  and  approximate  target  angle. 
Since  each  neuron  classifies  an  individual  target  angle  (six  per  ship),  a  simple  logical  OR  of  the  six  applicable 
neurons  for  a  specific  ship  would  give  irmnediate  classification. 


(a)  Ship  with  grid  overlay 


(b)  Simulated,  processed  radar  return  from  target 


Figure  2.  Grid  Overlay  for  Target  Data 


The  grid  sizes  are  proportional  to  the  distance  to  the  target.  Obviously  as  the  target  gets  farther  away,  it  becomes 
increasingly  difficult  to  achieve  accurate  granularity  on  the  grids.  One  solution  to  this  problem  is  to  change  the 
element  size,  such  as  combine  four  grid  elements  into  one  outside  of  certain  ranges  and  then  reclassify  based  on 
this  new  data.  The  problem  is  that  outside  a  certain  range,  all  targets  start  to  look  the  same  and  classification  may 
not  be  attainable.  This  is  a  problem  inherent  to  the  nature  of  the  SAR  itself  (7J. 

In  summary,  the  ART  network  will  receive  the  processed  radar  data  in  the  form  of  a  two  dimensional  array  of 
ones  and  zeros.  The  grid  provides  the  ART  network  with  a  binary  array  which  the  network  converts  to  a  colunm 
vector  for  analysis. 

There  could  be  additional  pre-processing  done  of  the  raw  SAR  data,  such  as  classification  of  targets  by  size. 
This  would  be  a  simple  way  to  eliminate  two  targets  of  dissimilar  sizes  with  similar  superstructures  For  example, 
separate  ART  1  networks  could  be  set  up  for  ships  between  certain  lengths  so  that  a  single  ART  1  would  not  have 
to  be  particularly  large  in  terms  of  neurons  to  cover  all  ships. 

In  the  processed  radar  return,  there  may  appear  to  be  a  considerable  amount  of  white  space  around  the  target 
This  white  space  is  to  allow  for  different  target  angles  of  the  same  ship  and  for  the  high  noise  area  around  the  hull 
of  the  ship  where  sea  return  is  a  major  factor  in  signal  loss  and  scattering.  Significant  pre-processing  of  the  raw 
radar  data  should  give  a  clean  grid  similar  to  the  one  in  Figure  2. 

m.  ART  1  Network  Design  and  Implementation 

The  ART  1  network  was  implemented  using  MATLAB  4.0  (5].  A  flowchart  of  the  basic  algorithm  for  this 
problem  is  illustrated  in  Figure  3. 


Figure  3.  ART  1  Implementation  Flowchart 
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Architecture  Implementation 


The  comparison  layer  is  composed  of  a  hardlimiter  with  three  inputs:  the  input  vector,  the  results  of  layer  two 
weighted  by  the  layer  1  weight  vector,  and  a  bias  (or  gain)  vector.  The  hardlimiter  evaluates  the  sum  of  these  three 
vectors  to  produce  a  vector  of  ones  and  zeros.  The  bias  vector  is  initialized  to  zeros,  but  switches  to  a  value  of  - 1  5 
for  the  second  run  through  this  layer  to  check  the  validity  of  the  winning  neuron. 

The  recognition  layer  is  composed  of  a  competition  algorithm  with  two  inputs;  the  results  from  layer  one 
weighted  by  the  layer  2  weight  vector  and  a  bias  vector.  This  bias  vector  is  used  to  eliminate  neurons  that  have 
been  disqualified  by  the  vigilance  parameter.  If  a  neuron  is  disqualified,  its  element  in  the  bias  vector  will  be  set 
to  -100,  thus  effectively  removing  it  from  further  competition.  The  Instar  training  algorithm  (S|  is  used  to  adjust 
both  weight  vectors  of  the  winning  neuron. 

The  vigilance  parameter  is  set  by  the  user,  and  it  reflects  the  result  of  a  bitwise  AND  operation  between  the 
input  vector  and  the  pattern  of  the  chosen  neuron.  The  elements  of  the  resulting  vector  are  summed  and  divided 
by  the  sum  of  the  input  vector  elements.  In  this  way,  the  vigilance  parameter  measures  the  percentage  of  ones  in 
the  proper  places. 


Preprocessing 

,  A  grid  of  size  16  x  36  is  superimposed  on  the  radar  image.  The  grid  squares  are  then  marked  ”1"  if  there  is 
a  RADAR  return,  ”0'  otherwise.  The  result  is  a  16  x  36  matrix  of  "  l'"s  and  ''0”'s  representing  a  rough  image  of 
the  target.  The  size  of  this  grid  is  crucial  and  will  be  discussed  later. 

The  rows  of  this  matrix  are  concatenated  to  form  one  long  vector.  This  vector  is  then  checked  to  determine 
if  it  is  all  zeros  (i.e..  no  target  present).  In  this  event,  it  is  designated  as  no  target,  and  the  algorithm  terminates. 
Otherwise,  the  vector  is  entered  into  the  network  to  be  classified. 

Parameters 

There  are  two  crucial  user  defined  parameters  in  the  network:  the  vigilance  and  the  maximum  number  of 
neurons.  These  two  values  are  interrelated  and  are  most  easily  determined  by  a  bit  of  trial  and  error.  Once  set,  they 
both  become  network  constants. 

The  vigilance  parameter  determines  the  coarseness  of  the  classifications  (3).  In  this  problem,  the  vigilance 
parameter  can  range  from  0  to  1 ,  with  the  higher  values  representing  more  strict  classifications.  Typical  values  seem 
to  range  from  .6  to  .99  for  this  problem. 

The  maximum  number  of  neurons  also  affects  the  classification  of  the  inputs.  In  the  world  of  computer 
simulation,  the  number  of  neurons  can  be  unlimited,  but  in  the  real  world  factors  such  as  cost  and  availability  will 
limit  the  maximum  number  [4j.  By  limiting  the  number  of  neurons,  we  are  setting  a  limit  on  how  many  different 
groups  of  classifications  we  are  allowing.  This  number  must  be  within  the  tolerances  determined  by  the  vigilance 
parameter. 

IV.  Results 

The  key  to  successful  results  was  the  proper  choice  of  the  vigilance  fiarameter.  To  begin  with,  the  vigilance 
parameter  must  be  chosen  to  provide  profier  discemability  in  the  classifications.  This  tends  to  lead  to  a  fairly  high 
value,  in  our  case  from  approximately  0.8  to  0.99.  These  settings  are  too  high  in  actual  practice  due  to  noise  in  the 
radar  return  which  creates  new  target  classifications  based  on  the  added  noise.  One  solution  is  to  alter  the  vigilance 
parameter  once  the  network  has  been  satisfactorily  "trained".  By  lowering  the  value,  the  network  becomes  more 
tolerant  of  noise  in  the  input.  The  effect  of  varying  the  vigilance  parameter  is  illustrated  in  Figure  4.  The  network 
was  initially  trained  at  a  vigilance  of  0.95,  and  then  the  images  were  corrupted  by  noise  and  tested  at  various 
settings  for  the  vigilance.  It  is  clear  that  the  optimum  value  for  operation  is  closer  to  0.6,  and  the  performance 
deteriorates  rapidly  around  0.85. 
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Ship  Classifications 

Correct  ID'S  vs  Testing  Vigilance  roMigv^inoe-oM 


Figure  4.  Testing  Vigilance  Results 


Figure  5.  Ship  Classification  Results 


A  very  low  value  (0. 10)  for  the  vigilance  parameter  was  chosen  for  a  second  set  of  test  cases  to  illustrate  the 
effect  of  coarse  classification.  The  images  tend  to  be  classified  into  a  few,  broad  categories.  Testing  with  this  value 
for  this  vigilance  yields  consistent  results  due  to  the  broad  range  of  the  resultant  groupings. 

Figure  S  points  out  that  the  success  rate  of  the  classifications  is  dependent  on  the  ship  type.  In  general  pattern 
recognition,  this  means  that  the  percentage  of  successful  classifications  of  a  pattern  is  dependent  on  the  pattern  itself 
Using  our  silhouettes,  the  amphibious  ship  fared  the  worst  while  the  battleship  and  the  carrier  seemed  to  do  the  best. 
This  can  be  attributed  to,  in  part,  the  fact  that  the  carrier  and  battleship  silhouettes  contain  more  distinguishing 
features  than  the  other  vessels. 

In  practice,  a  network  is  also  limited  by  the  maximum  number  of  neurons  available.  In  implementation,  the 
number  of  neurons  is  proportional  to  the  amount  of  "memory"  needed.  In  our  test  runs,  if  the  number  of  neurons 
chosen  is  too  low,  the  network  is  unable  to  classify  all  of  the  images.  This  is  because  the  network  does  not  modify 
the  weights  of  an  existing  neuron  if  the  result  is  out  of  tolerance  (set  by  the  vigilance  parameter). 

Thus,  in  field  applications,  the  maximum  number  of  neurons  and  the  vigilance  parameter  are  crucial  and  difficult 
quantities  to  determine.  In  practice,  they  need  to  be  set  by  a  qualified  technician  and  considered  fixed  to  the  user 
in  the  field.  Any  changes  in  these  values  could  lead  to  "instability"  in  the  network. 

It  should  be  noted,  once  the  network  has  stabilized  to  a  given  set  of  neurons,  it  will  precisely  classify  the 
original  set  of  training  vectors.  This  is  because  the  network  will  not  alter  the  weight  vectors  if  the  result  is  a  perfect 
match. 


IL  Conclusions 

ART  I  networks  are  well  suited  for  targe  ication  and  recognition.  The  ability  to  learn  and  classify  targets 

proved  to  work  well  for  this  application.  The  .S/\K  target  information  was  classified  into  separate  categories  when 
the  vigilance  parameter  was  set  to  a  proper  value. 

The  difficulty  in  developing  the  system  lies  primarily  in  the  implementation  and  the  choice  of  the  network 
parameters.  The  vigilance  parameter  tends  to  be  highly  sensitive  and  difficult  to  determine  other  than  by  trial  and 
error  In  addition,  varying  levels  of  noise  would  require  different  degrees  of  classifications  and,  thus,  different 
values  for  the  vigilance  parameter.  A  possible  solution  is  to  present  the  network  a  library  of  known  images  at  a 
high  vigilance  parameter  and  then  send  it  out  to  the  field  with  a  lower  value  for  the  noisy  environment. 

A  limiting  factor  in  the  implementation  is  the  maximum  number  of  neurons  available.  If  the  network  cannot 
place  an  image  within  the  required  tolerance,  it  will  need  to  utilize  another  neuron.  If  no  neuron  is  available,  the 
network  will  fail  to  classify  the  image  rather  than  corrupting  a  previous  category. 

There  are  two  major  areas  that  can  be  further  explored;  the  redefinition  of  the  vigilance  parameter  and  the 
adaptation  of  the  problem  to  an  ART  2  model. 
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The  vigilance  parameter  can  be  modified  to  take  advantage  of  certain  a  priori  knowledge  in  the  working 
environment.  Rather  than  using  a  single  value,  the  vigilance  parameter  can  consist  of  a  vector  of  values  This 
would  allow  the  operator  to  utilize  different  criteria  for  different  sections  of  the  image.  Specifically,  the  operator 
can  make  use  of  the  fact  that  for  a  surface  target,  the  top  1/3  of  the  radar  return  will  consist  of  the  superstructure, 
the  middle  1/3  would  emphasize  the  hull,  and  the  bottom  1/3  would  be  dominated  by  noise  from  sea  return.  The 
vigilance  parameters  could  be  varied  to  emphasize  the  general  hull  outline  and  superstructure  characteristics  while 
deemphasizing  the  sea  clutter. 

To  take  advantage  of  the  varying  signal  return  strength,  this  problem  is  ideally  suited  to  application  of  an  ART 

2  network.  An  /VRT  2  network  is  capable  of  handling  continuous  analog  input  as  well  as  binary  data  [2)  The  gray¬ 
scale  levels  inherent  in  a  SAR  image  presentation  could  be  used  to  further  classify  the  image,  essentially  adding  a 

3  dimensional  knowledge  to  the  classification.  The  different  gray-scale  levels  can  be  attributed  to  different  relative 
ranges,  as  well  as  different  target  material  (7).  Unfortunately,  the  price  for  this  capability  is  the  increased 
complexity  [4].  The  typical  ART  2  network  processes  a  discrete  number  of  values  over  a  certain  finite  range  and 
requires  a  great  deal  of  pre-processing  of  the  data  |6|  before  it  is  accepted  by  the  network.  Furthermore,  the  analog 
data  presents  additional  noise  problems  that  would  necessitate  a  significant  amount  of  signal  processing. 
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Abstract 

Error  Correcting  Adaptive  Restmance  Theory  (ECART)  networks  represent  differences  between 
inputs  and  templates  explicitly.  Thu  class  of  ART  networks  generalises  the  attentional  vigilance  par 
ranoeter  of  other  ART  networks  into  an  attenHonal  vigilance  vector.  The  attentional  vigilance  vector 
allows  independent  control  of  the  vigilance  associated  with  each  input  element.  Raising  the  vigilance  of 
an  input  element  corresponds  to  lowering  the  acceptable  categorical  error  tolerance  associated  with  that 
element.  Four  seta  of  rules  are  required  to  define  the  function  of  an  ECART  network:  (1)  a  code  selec¬ 
tion  rule,  (2)  a  resonance  or  reset  rule,  (3)  template  learning  rules,  and  (4)  initial  conditions.  Different 
sets  of  rules  can  be  chosen  to  make  the  ECART  network  functionally  equivalent  to  a  number  of  different 
classifiers,  including  ART  1,  ART  2a,  Fussy  ART,  and  the  supervised  forms  of  each  of  these.  The  foctu 
of  this  paper  is  on  the  ability  of  ECART  networks  to  learn  not  only  the  category  representatives  but 
also  the  variability  associated  with  each  category.  FVom  a  statistical  point  of  view,  learning  category 
prototypes  and  their  variability  is  similar  to  estimating  means  and  variances.  Learning  both  categorical 
prototypes  and  their  variability  provides  robust  recognition  of  patterns  in  ncity  environments. 

1.  Introduction 

Tbe  basic  principles  of  Ad^tive  Resonance  Theory  (ART)  were  introduced  by  Grossberg  (1976).  A 
number  of  ART  networks  have  been  introduced  and  stuped  since  then  (Carpenter  and  Grossberg, 
1987a,  1987b;  Carpenter,  Grossberg,  and  Rosen,  1991;  Carpenter,  Grossberg,  and  Reynolds,  1991; 
Carpenter,  Grossberg,  Markuson,  Reynolds,  and  Rosen,  1991).  Baxter  (1991a,  1991b)  introduced 
a  class  of  adaptive  resonance  networks  which  explicitly  compute  errors  between  input  vectors  and 
learned  tenq>lates.  Such  networks  are  referred  to  as  ECART  (Error  Correction  Ad^tive  Resonance 
Ibeory)  networks.  ECART  networks  can  operate  as  unsupervised  or  supervised  pattern  recognition 
machines.  Unsupervised  ECART  networks  represent  coding  errors  -  errors  between  input  vectors  and 
learned  feature  templates.  Supervised  ECART  networks  represent  both  coding  errors  and  predictive 
errors.  In  these  networks  predictive  errors  encode  differences  between  learned  outcomes  and  actual 
outcomes,  and,  in  the  absence  of  coding  errors,  determine  the  appropriate  number  of  coding  cells  to  use 
to  represent  the  training  set.  One  additional  flexibility  of  ECART  networks  over  other  networks  that 
comes  naturally  from  explicit  representation  of  errors  is  the  generalization  of  the  concept  of  attentional 
vigilance.  ECART  networks  allow  the  formation  of  a  set  of  attentional  vigilance  parameters,  referred  to 
as  the  attentional  vigilance  vector.  The  attentional  vigilance  vector  allows  independent  control  of  the 
vi^ance  associated  with  each  input  element.  Raising  the  vigOance  of  an  input  element  corresponds  to 
lowering  the  acceptable  categorical  error  tolerance  associated  with  that  element. 

In  statistical  pattern  classification  problems,  categories  can  be  represented  by  a  number  of  param¬ 
eters.  The  most  conmumly-used  parameter  is  the  category  prototype,  e.g.  the  mean.  A  number  of 
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claMifien  only  use  a  prototypical  repreaentatioa  of  categories.  Categories  can  also  be  represented  by 
their  boundaries.  It  is  often  necessary  to  represent  categories  by  more  than  one  type  of  representation 
in  order  to  capture  key  informational  aspects.  Consider  the  problem  of  classifying  exeiiq>lar8  generated 
by  two  different  Gaussian  distributions.  Only  two  categories  and  two  parameters  for  each  category 
(the  mean  and  variance)  are  necessary  to  represent  and  classify  the  exemplars  generated  by  the  two 
distributions.  If  these  two  parameters  are  estimated  correctly  for  each  category,  the  Bayesian  optimum 
classification  performance  can  be  obtained.  ART  networks  represent  categories  by  prototype  (as  in 
ART  1  and  ART  2)  ot  by  boundaries  (as  in  I^uy  ART).  In  t^  paper,  unsupervised  and  supervised 
ECART  networks  that  represent  categories  by  both  prototypes  and  boundaries  are  described. 

2.  ECART  network  function  and  architecture 

Figure  1  depicts  the  flow  of  signals  in  an  ECART  network.  The  major  components  of  an  ECART 
network  are  an  input  field,  a  competitive  coding  field,  a  coding  error  field,  a  predictive  error  field,  and 
a  novelty  detector.  ECART  networks  can  be  implemented  as  massively  parallel  networks  for  extremely 
fast  processing. 

Four  sets  of  rules  are  required  to  define  an  ECART  network:  (1)  a  code  selection  rule,  (2)  a  resonance 
or  reset  rule,  (3)  learning  rules,  and  (4)  initial  conditions  (initial  template  values).  Many  possible  vari¬ 
ations  of  ECAI^  networks  are  possible  by  choosing  different  sets  of  rules;  two  unsupervised  variations 
are  described  in  the  following  paragraq>h8. 

Bidirectioiial  ECART 

The  first  variation  has  feature  templates  with  elements  that  can  increase  and  decrease  in  value;  this 
variation  will  be  referred  to  as  Bidirectional  ECART.  Let  the  scaled  input  pattern  be  denoted  by  x,  the 
elements  of  the  scaled  input  pattern  are  denoted  by  z,-  (z,  are  scaled  such  that  their  values  lie  on  the 
closed  interval  [0, 1]),  adaptive  feature  template  j  is  denoted  by  Wj,  and  the  elements  of  the  adaptive 
feature  templates  are  denoted  by  Wj,-.  Feature  codes  of  the  feature  templates  are  indexed  by  j.  The 
selected  code  for  a  given  input  pattern  is  denoted  by  J  and  is  the  code  with  the  maximum  s^-  over  all 
codes  j  €  R,  where 

=  («<■  +  «<“/  (1) 

i 

and  where  =  Zj  -  wji,  ef  —  and  ef  =  The  notation  [ej]***  indicates  the  rectification 

function  max(e{,  0).  The  value  of  p  is  either  1  or  2;  an  Li  norm  is  obtained  with  p  ~  1,  whereas  an 
L]  (Euclidean)  norm  is  obtained  with  p  =  2.  A  is  the  set  of  all  previously-learned  codes  that  have  not 
been  reset  for  the  current  input  pattern. 

The  novelty  detector  determines  whether  an  input  pattern  is  novel  or  familiar.  If  a  pattern  is 
sufficiently  familiar,  the  network  is  said  to  be  in  the  resonant  state  and  the  most  similar  feature  template 
is  adapted  to  better  match  the  current  input  pattern.  The  resonance  criterion  is  given  by 

r  =  'Z[4+‘T-Ei]*=0  (2) 

f 

where  the  Ei  are  the  allowable  error  tolerances.  Since  z,-  and  Wji  lie  in  the  closed  interval  [0, 1]  the  Ef 
are  related  to  the  elements  of  the  vigilance  vector  by  p,  =  1  —  Ej.  A  small  value  of  Ei  corresponds  to 
a  high  vigilance  for  element  t.  If  all  E,-  =  0,  every  unique  input  pattern  will  create  a  unique  template. 
If  r  >  0,  then  the  novelty  detector  resets  the  selected  code  (J  is  removed  from  R)  and  the  search  for  a 
better  code  ensues.  If  no  existing  templates  are  sufficiently  familiar,  then  a  new  template  is  created. 
The  discTete-time  form  of  the  learning  rule  is  ^ven  by 

«»/,•(< +  l)  =  tn/,(f)  +  Ajei(f)  (3) 

where  Xj  is  the  learning  rate  associated  with  code  j  and  is  restricted  to  positive  values.  In  Bidirectional 
ECART,  the  initial  values  of  the  elements  of  -wj  are  set  to  zero. 
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Predicted  Outcome 


Figure  1:  ECART  netwoiki  coniUt  of  u  input  field,  n  competitive  coding 
field,  n  coding  error  field,  n  predictive  error  field,  end  n  novelty  detector.  Tlie 
blnck  trinnglee  repreeent  long-term  memory  vreighte. 
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The  value  of  Xj  affects  the  stability  of  the  feature  codes.  Bidirectioual  ECART  is  equivalent  to  the 
leader  ^§ori1km  if  is  initially  set  to  unity  sod  is  set  to  sero  after  the  first  update  of  wj .  A  more 
general  rule  is  to  initially  set  Xj  to  unity,  and  decrease  Xj  each  time  code  j  is  selected  by  a  decay  factor. 
If  the  decay  factor  does  not  change,  this  rule  results  in  an  exponential  decay  in  the  learning  rate  as  a 
function  of  the  number  of  times  the  code  is  chosen.  Note  that  the  templates  in  a  Bidirectional  ECART 
represent  category  prototypes.  If  Aj  is  a  small  constant  and  learning  involves  many  passes  through  the 
training  set,  the  prototypes  will  approach  the  category  means. 


Unidirectional  ECART 

The  templates  of  the  Bidirectional  ECART  network  can  oscillate;  therefore,  the  feature  codes  may  not 
be  stable  for  some  sets  of  input  patterns  unless  Xj  decays  sufficiently  fast  during  training.  Several  ways 
of  making  Xj  decay  during  training  were  discussed  in  the  previous  paragraph.  An  alternative  method  of 
stabilizing  the  templates  is  to  restrict  the  direction  in  which  the  templates  can  change.  This  variation 
of  ECART  is  referred  to  as  Unidirectional  ECART.  Unidirectional  ECART  requires  two  set  of  weights 
which  will  be  denoted  by  and  wJ'^  with  initial  values  0  and  1,  respectively.  The  coding  errors  in 
Unidirectional  ECART  are  pven  by 


et  =  + 

(4) 

e."  = 

(5) 

and  the  learning  rules  are  given  by 

(6) 

+  =  “'J.(0-A/er(0 

(7) 

The  code  selection  and  resonance  criteria  remain  unchanged. 

In  Unidirectional  ECART  the  templates  represent  categorical  boundaries  and  can  be  described 
geometrically  as  hyper-rectangles.  If  an  input  vector  falls  within  one  of  the  existing  hyper-rectangles,  it 
belongs  to  that  cluster  and  no  learning  takes  place.  If  an  input  vector  does  not  lie  within  an  existing 
hyper-rectangle,  it  will  cause  the  most  similar  hyper-rectangle  (defined  by  sj)  to  expand  to  include  that 
vector  -  provided  the  resonance  criterion  is  satisfied.  If  the  resonance  criterion  is  not  satisfied,  a  new 
cluster  is  created. 

Different  code  selection  and  resonance  criteria  are  possible.  For  example,  with  the  code  selection 
function 


min(x,-,  u>t)  +  maxjxj,  wjj) 


(8) 


and  the  resonance  criterion 


r 


min(*,-,  wfi)  +  maxixi.wj-i) 


>  P 


(9) 


the  Unidirectional  ECART  network  is  functionally  equivalent  to  Fuzzy  ART  (Carpenter,  Grossberg, 
and  Rosen,  1991). 


3.  Learning  prototypes  and  error  bounds 

The  Bidirectional  and  Unidirectional  ECART  networks  can  be  combined  to  form  a  network  that  repre¬ 
sents  category  error  boimds  as  well  as  prototypes.  The  simplest  way  of  combining  the  two  networks  is 
simply  to  let  them  operate  independently  in  parallel  and  use  the  templates  of  the  Bidirectional  ECART 
network  as  the  prototypes  and  those  of  the  Unidirectional  ECART  network  as  the  error  bounds.  The 
problem  with  this  method  is  that  the  two  networks  may  not  generate  the  same  number  of  categories; 
thus,  a  one-to-one  correspondence  between  prototypes  and  error  bounds  is  not  guaranteed  (and  is 
unlikely  in  many  cases). 
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An  piderable  method  ia  to  use  three  different  types  of  coding  errors  and  templates  with  the  following 
learning  rules 


where 


vijS  -H)  = 

wri(<)  +  A/e,(t) 

(10) 

<(<  +  !)  = 

“?•(*)  +  /*.?«<■ (0 

(11) 

»"J,(<  +  1)  = 

«'7<(0-Mr«r(t) 

(12) 

e,  = 

*i  -  V>ii 

(13) 

= 

(*<  - 

(14) 

«r  = 

1 

+ 

(15) 

The  code  selection  and  resonance  criteria  are  the  same  as  before.  If  /ij  =  this  method  guarantees 
that  the  prototype  templates  wn  are  in  one-to-one  correspondence  with  the  error  botmd  templates  wjl-. 
In  this  method,  the  code  selection  and  resonance  criteria  depend  exclusively  on  the  category  boimdaries 
(«"*<)• 

Alternatively,  the  category  prototypes  can  be  used  to  select  the  most  appropriate  codes  and  to 
generate  reset  signals.  More  generally,  the  code  selection  and  resonance  criteria  can  make  use  of  some 
combination  of  the  category  prototypes  and  boundaries  in  order  to  select  the  most  appropriate  code. 
Fbr  example,  the  code  selection  criterion  can  be  generalized  to  determine  the  code  with  the  maximum 

+ ‘tY  -  i«'I"  (18) 

i  i 

and,  similarly,  the  resonance  criterion  can  be  generalized  to  the  form 


r  =  E  [“  (‘f  +  «r )  +  =  0 


(17) 


Here,  a  and  /?  determine  how  much  emphasis  is  placed  on  categorical  prototypes  and  boundaries, 
respectively,  in  the  code  selection  and  resonance  processes. 


4.  Summary 

Several  methods  of  extending  ECART  networks  to  learn  categorical  prototypes  and  error  bounds  have 
been  discussed.  The  extensions  are  combinations  of  Bidirectional  and  Unidirectional  ECART  networks. 
The  category  prototypes  may  have  chaotic  trajectories  but,  with  appropriate  restrictions,  they  can  be 
guaranteed  to  be  bounded  from  above  and  below  by  the  categorical  boundaries.  These  techniques  can 
be  applied  to  both  unsupervised  and  supervised  ECART  networks. 
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ABSTRACT 

When  a  set  of  raw  image  data  is  preprocessed  properly,  the  training  of  an  artificial-perceptron 
pattern-recognizer  may  be  achieved  in  a  noniterative  manner.  The  noniterative  training  is  not  only 
very  fast  (e.g„  2  seconds  for  training  4  hand-written  characters),  but  may  also  be  very  robust 
(e.g.,  the  recognization  could  be  rotation-invariant,  size-invariant,  and  location-invariant  even 
though  the  perceptron  is  trained  with  only  unrotated  standard  patterns.).  The  high  robustness  of 
this  noniteratively  trained  perceptron  is  due  to  the  optimum  noniterative  training  scheme  and  the 
Fouricr-transform  preprocessing  scheme  we  adopted  in  our  design.  This  paper  reports  the 
theoretical  origin  and  Ae  experimental  results  of  this  novel  perceptron  pattern  recognition  systent 
An  unedited  video  movie  of  Ae  whole  trainingAecognition  process  is  recorded  in  real  time  for 
demonstration  purpose. 

Key  words:  Pattern  Recognition,  Non-Conventional  Unsupervised  Learning 

L  INTRODUCTION 

Preprocessing  of  raw  image  data  is  generally  a  veiy  in^ortant  part  in  pattern  recognition  schemes 
employing  artificial  perceptrons.  Preprocessing  must  not  be  too  detail  nor  too  coarse  in  order  to 
achieve  Idgh  robusAess  in  recognifion  while  still  preserve  Ae  differentiation  capability  of 
recognizing  different  traiimig  patterns.  Fourier  descriptors  have  been  used  in  preprocessing  Ae 
raw  data  by  many  pattern-recognition  researchers  [1,2].  Once  when  a  proper  preprocessing 
scheme  is  selected,  Ae  traiimig  of  Ae  perceptron  and  Ae  recognition  of  Ae  untrained  patterns  can 
generally  be  achieved  very  efficiently  by  applying  a  noniterative  training  scheme  to  a  one-layered, 
hard-limited  structure  [3,4,5].  The  Aeory  and  Ae  experiments  of  this  novel  compound 
perceptron  learning  system  is  reported  in  detail  in  the  following. 

n.  FOURIER  TRANSFORMS  USED  IN  PREPROCESSING 
Global  Properties  v,s.  Local  Properties 

When  we  look  at  Ae  two  characters  shown  in  Fig.  1,  we  can  recognize  Aem  immediate  as  A  and 
B.  The  reason  Aat  our  brains  can  immediately  recognize  Aese  characters  is  NOT  Aat  we  have 
inspected  carefiilly  all  Ae  detailed  small  variations  such  as  Ae  crooked  parts  in  A  or  Ae  fuzzy 
lin^  in  B.  But  instead,  we  have  picked  up  Ae  general  structures  or  Ae  coarse  variations  of  Ae 
images  which  allow  us  to  make  Ae  immediate  decisions.  These  general  structures  are  NOT 
LOCAL  properties  of  Ae  images.  They  are  GLOBAL  properties  of  Ae  images.  That  is,  if  some 
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crooked  parts  of  A  are  changed  or  some  fuzzy  lines  of  B  are  changed,  etc.,  locally,  this  makes 
quite  a  difference  in  the  image.  But,  as  a  whole,  it  does  not  affect  our  brain  decisions  at  all. 
Consequently,  in  the  design  of  an  artificial  character  recognition  scheme  which  we  wish  to  have  a 
robustness  aj^roaching  as  close  as  possible  to  that  of  our  biological  recognition  system,  two 
factors  must  be  taken  into  account: 

I.  It  must  be  able  to  automatically  extract  global  properties  of  the  images, 
n.  It  must  be  able  to  filter  out  the  small  variations  in  the  images. 

Fourier  transforms  of  the  image  functions  with  high  (spatial)  frequency  components  truncated  off 
seem  to  be  able  to  meet  both  of  these  conditions.  This  is  so  because  each  Fourier  component  is 
calculated  from  the  whole  image  function,  not  from  a  local  part  of  the  image.  Therefore  it  is  a 
global  property,  not  a  local  property.  Also  if  we  truncate  off  all  high  frequency  Fourier 
components,  we  ate  filtering  out  all  small  spatial  variations  in  the  images. 

In  addition  to  satisfy  these  conditions,  if  we  separate  the  r-function  from  the  B-funcdon  in  the 
Fourier  analysis  as  shown  below,  we  may  also  obtain  rotation-invariant  robustness  in  the 
recognitions. 

Fourier  Transforms  in  a  Polar  Coordinate 

Suppose  we  have  a  digitized  image  in  the  form  of  black  and  white  (blank)  pixels  and  the  x,  y 
coordinates  of  each  (black)  pixel  are  known,  llien  we  can  find  the  x,  y  coordinates  of  the 
centroid  of  the  black  pixels.  If  we  use  this  centroid  point  as  the  center,  we  can  draw  a  polar 
coordinate  with  maximum  radius  equal  to  the  maximum  r  of  all  the  pixels  (Fig.  2).  Each  dot  in 
Hg.  2  represents  the  center  of  each  black  pixel  Counting  the  number  of  dots  in  each  ring  will 
give  us  a  radial  vector  V,  and  counting  that  in  each  sector  will  give  us  an  angular  vector  V,.  If 
there  are  J  tings  and  K  sectors  in  the  polar  coordinate,  then  these  two  vectors  are,  respectively,  J- 
dimension  and  K-dimension  analog  vectors  of  integer  components.  These  vectors  can  also  be 
represented  by  quantized  analog  scalar  functions  f|(r)  and  g(6)  with  integer  quantization  levels 
such  as  the  ones  shown  in  Fig.  3.  If  we  normalize  each  level  in  f,  with  respect  to  the  area  of  each 
ring,  arxl  normalize  r  with  respect  to  r^„  and  call  the  new  function  f(r),  then  we  can  apply  the 
following  Hankel  and  Fourio*  transformations  to  f  and  g  respectively  to  get  the  spatial 
frequency  components  F^,  and  G„  of  these  two  functions.  (Hankel  transform  is  equivalent  to 
Fourier  transform  of  a  circularly  symmetric  function  along  the  radial  direction  of  the  polar 
coordinate  [6].) 

1 

=2itJ^(r)yo(2JWir)dr  (1) 

0 

where  Jq  is  the  zeroth  order  Bessel  function  of  first  kind. 
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C.= 


(2) 


where  I  I  means  taking  the  magnitude  of  the  complex  number  enclosed  inside.  Notice  that  the 
phase  angle  in  die  Fourier  transform  in  (2)  is  ignored. 

The  physical  meanings  of  these  Fourier  components  are  the  following.  F^  represents  the 
variation  of  the  radial  structure  of  the  whole  image  in  the  "detail  range  m"  (or  frequency  range 
m).  Similarly,  G,  represents  the  variation  of  the  angular  structure  of  the  whole  image  in  the 
detail  range  n.  ^th  higher  frequency  components  truncated  off,  these  F,„  and  G„  then  constitute 
an  analog  word  and  this  analog  word  is  the  input  word  we  use  for  training  the  perceptron.  Notice 
that  when  the  image  is  rotated,  this  analog  word  will  not  change  at  all.  (The  rotation  invariance 
of  Fg,  is  obvious  because  f(r)  does  not  depend  on  6.  The  rotation  invariance  of  Gg  is  due  to  the 
fact  that  if  6  in  (2)  is  replaced  by  6+6o  ^^cre  Bq  is  any  rotation  angle,  the  magnitude  of  the 
complex  integration  will  NOT  depend  on  the  initial  phase  Oq.) 

Because  of  the  quantized  nature  of  the  f  and  g  functions  shown  in  Hg.  3,  the  actual  computation 
of  the  integrals  is  converted  to  finite,  weighted  sums  of  the  quantization  data.  These  finite  sums 
are  called  the  segmented  Fourier  transforms  which  are  very  easy  and  very  fast  to  handle  in 
digital  computations.  These  transforms  are  very  similar  to  the  conventionally  used  FFT. 

m.  THE  NONTTERATIVELY  TRAINED  HARD-LIMITED  PERCEPTRON 


The  perceptron  we  use  for  training  and  recognition  is  very  similar  to  the  ones  we  reported 
previously  [3,  4, 5].  We  use  also  the  optimal  one-step  learning  scheme  we  reported  in  [4]  such 
that  the  training  vectors  are  located  with  equal  noise  distances  to  any  dichotomization  plane  in  the 
N-space.  But  instead  of  applying  this  optimum  scheme  to  a  supervised  learning  scheme,  we  apply 
it  to  an  unsupervised  leaniing  scheme  to  elimnate  the  illegality  problems  we  occasinally 
encountered  in  a  supervised  learning  system.  The  general  theory  of  this  one-step  learning  scheme 
is  briefly  reviewed  here. 

If  the  input-ouq>ut  mapping  vectors  to  be  learned  are  m=l  to  M}  where  Um  is  an  N- 

dim  analog  vector  and  Vm  is  a  P-bit  birary  vector,  then  the  goal  of  learning  is  to  find  the 
connection  matrix  a^  in  the  following  simmultaneous  nonlinear  algebraic  equations. 

|i| 

=  Sgni  I  ),  i=l  to  P,  m=l  to  M  (3) 

where  U|ni,  Vnj  are  components  of  Um  and  Vm-  For  a  supervised  learning  problem,  Um»  Vm  are 
given,  we  multiply  both  sides  of  (3)  by  Vui  which  is  =  either  1  or  -1,  (3)  is  then  changed  to 

II 

Z  a^VggUgy  >  0,  i=l  to  P,  m=l  to  M  (4) 
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(the  property  that  VjTiX)  is  used  in  the  above  derivation.)  Since  j  is  the  only 

running  index  in  (4),  (4)  can  be  re-written  in  the  form  of  the  inner  product  between  two  vectors: 


Ai*Ymi  >  0  m=l  to  M,  i  fixed.  (5) 

where  A|  is  an  N-dim  vector  representing  the  i-th  row  of  ay  matrix;  and  Ymj  =  Vi^iUm  =  the  m-th 
input  vector  dichotomized  according  to  the  i-di  output  bit  in  the  m-th  mapping  pair.  Whenever 
{Uin->Vi||}  mapping  is  given,  Ymj  are  ^ven,  and  Aj  can  be  solved  from  the  M  simultaneous 
inequalities  shown  in  (S).  The  solution  may  or  may  not  exist  The  if-only-if  condition  that 
solution  exists  in  (5)  is,  by  Parkas  Lemma,  that 

all  { Y|||i>  i  fixed,  m=l  to  M}  are  positively  lineariy  independent  (or  PLI).  (6) 

PLl  means  that  no  positive  real  constants  {Pn^.  in=l  to  M},  except  all  O's,  satisfy  the 
following  linear  relation  among  all  Ymi's 

2p«Y^=0  i  fixed  (7) 

Fbr  a  supervised  learning  problem,  (Ymi)  are  fixed  when  (Um-^Vm)  mapping  is  given.  (6)  may 
be  violated  and  the  solution  of  ay  may  not  exist  at  all.  This  nuq>ping  is  called  an  illegal  mailing 
for  the  supervised  learning  in  me  one-layered  perceptron  (OLP).  That  means,  no  matter  what 
learning  rule  we  use  to  train  OLP,  we  cannot  realize  the  ^ven  mapping  at  all  if  (6)  is  violated  in  a 
supervised  learning  scheme.  On  the  other  hand,  if  each  training  Um  represents  a  different 
pattern  to  be  recognized,  then  we  can  use  the  unsupervised  approach  to  assign  each  Vm  in  such 
a  way  that  aU  the  M  Um~>Ym  mailing  pairs  become  legaL  Therefore  solutions  of  (5)  definitely 
exists.  The  algorithm  for  solving  (S)  is  then  trivial  and  the  solution  can  be  obtained  by  one 
matrix-operation  step.  This  unsupervised  one-step  learning  scheme  is  what  we  adopted  in  our 
present  design. 


IV.  EXPERIMENTAL  RESULTS 

We  use  a  mouse  in  Microsoft  Visual  Basic  (loaded  to  a  486  PQ  to  write  any  free-hand  letters 
such  as  the  ones  shown  in  Fig.  4.  A  program  written  in  Visual  Basic  then  digitizes  the  image  and 
stores  the  x,y  positions  of  each  quantized  pixel  in  a  data  file.  Recalling  the  data  file  and  apply  it  to 
an  event-driven,  interacting  program  we  designed  according  to  the  above  discussion  then  allows 
us  to  preprocess  the  data  as  weU  as  to  train  the  neural  network  or  to  test  the  image.  We  use  a 
one-layered,  hard-limited  perceptron  with  32  channels  of  analog  input  The  output  is  a  2-bit 
binary  vector.  We  used  the  regularly  hand-written  letters  such  as  the  four  letters  shown  in  Hg.  4- 
1  for  training.  Then  we  can  test  any  free-hand  writings  such  as  the  ones  shown  in  Hg.  4-2.  Mth 
more  than  80%  successful  rate,  these  free-hand  writings  can  be  recognized  correctly  in  the  test 
Notice  that  the  test  patterns  include  script  letters  and  rotated  letters  which  are  not  included  in  the 
training  set 


V.  CONCLUSION 
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Applying  a  special  Fourier-transform  preprocess  scheme  to  an  unsupervised  learning  in  a  one- 
layer^,  hard-limited  perceptron  allows  us  to  obtain  high  robustness  when  the  perceptron  is 
switched  to  the  recognition  mode.  The  training  of  the  perceptron  is  veiy  fast  (4  patterns  in  2 
seconds,)  because  the  training  is  noniterative  aiKl  one-step.  The  recognition  of  the  perceptron  is 
very  robust  because  we  apply  an  optimum  training  scheme  to  the  preproccesed  global  properties 
of  the  image.  An  unedited  video  movie  is  recorded  for  a  real-time  demonstration  of  the  training 
and  the  recognition  experiments. 
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ABSTRACT 


A  modular  recurrent  connectionist  architecture  is  proposed  to  classify  binary  and  continuous 
patterns.  This  system  consists  of  three  networks:  one  feed-forward  Back-Propagation  (BP) 
network  and  two  Self-Organization  Map  (SOM)  networks.  The  feed-forward  (basic)  network 
is  trained  imtil  a  saturation  error  level  occurs.  Simultaneously,  the  first  SOM  (input  control) 
network  and  the  last  SOM  (output  control)  are  defining  the  mapping  features  for  the  given 
input/output  patterns.  The  resultant  features  are  used  by  a  Gaussian  potential  function  to 
adjust  the  weights  of  the  basic  network  ai^  to  classify  the  given  patterns. 
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Summary 


A  connectionist  system  can  be  evaluated  in  terms  of  its  capability  to  represent  the  knowledge 
(patterns)  in  terms  of  ii^t-ouQHit  mapping  function.  An  accurate  representation  of  this 
mapping  depends  on  many  issues  including  the  system  architecture,  the  type  of  the 
connections,  the  number  of  neurons  in  the  hidden  units,  the  type  of  activation  functions,  the 
learning  algorithm,  and  the  input  set  representation.  These  issues  are  important  to  find  the 
q)timal  parameters  of  the  network  to  represent  the  given  knowledge.  In  this  research,  the 
connectionist  system  architecture  and  the  dynamic  weights  are  the  main  issues  to  be  explored 
and  examined. 

Although  many  psychologist  and  biologists  [1]  believe  that  the  brain  has  a  modular 
architecture,  only  a  few  researchers  have  applied  the  idea  of  modularity  in  the  com^tionists. 
There  are  several  advantages  of  the  modular  architecture  over  a  single-network  architecture 
[2,3]  iiKtuding  a  higher  learning  speed,  an  easier  interpreted,  debugged,  and  extended  pattern 
representations,  a  reduced  hardware  implementation,  aixl  a  better  generalization. 

The  generalization  is  a  measure  of  how  well  the  network  performs  on  the  actual  patterns 
once  the  training  is  complete.  Making  the  weights  dynamic  is  the  biggest  parameter  in 
improi'ing  the  generalization.  Several  researches  have  been  developed  towards  making  the 
weights  dynamic.  These  include  Grossberg’s  shunting  model  [4]  and  Nowlan  and  Hinton ’s 
model  for  soft  weights  sharing  [5].  In  addition  to  these  researchers,  some  studies  combined 
the  concept  of  dynamic  weights  with  modular  architecture,  including  Pollack’s  cascaded 
back-propagation  model  [6]  and  Jacobs’s  mixed  expert  network  [2]. 

In  our  research,  we  are  developing  novel 
modular  architectures  based  on  the 
dynamic  of  the  connections.  Previously,  a 
modular  feed-forward  architecture  [7,8] 
based-on  a  back-propagation  (basic) 
network  and  a  self-organization  mapping 
(control)  network  was  proposed  and 
examined  (Figure-1).  The  basic  network 
uses  the  Back-propagation  learning 
algorithm  [9]  and  has  only  one-hidden 
layer  (i.e.  s=3)  with  (k)  hidden-units.  This 
network  updates  the  weights  Wy(t)  during 
the  learning  phase  by  the  Back-propagation 
rule. 

(Figure-1)  A  Modular  Feed-Forward  Architecture 
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At  the  same  time,  the  control  network  (Self-Organization  Map  [10])  finds  a  mapping  from 
the  iiqiut  pattern  space  jc  €  9^*  onto  a  regular  two-dimensional  array  (n*k).  With  every  node  / 
in  this  array,  a  parametric  reference  vector  m,  €  is  associated.  An  iiq>ut  pattern  may  be 
con^iared  with  all  the  m,  in  the  smallest  of  the  Euclidean  distance  |x-mj  to  represent  the 
best-matching  node  which  is  signified  by  c  as: 


|x-/n  J  =mi  ni||x-n7  j|} 


Note  that  every  ii^ut  pattern  Xj  is  mapped  onto  the  node  c  relative  to  the  parameter  ntj. 

The  feed-forward  modular  architecture  [7]  uses  the  best-matching  node  in  the  two- 
dimensional  array  (i.e.,  c)  to  update  the  weights  of  the  ouQiuts  of  the  basic  network  by  Aw^ 
which  is  given  by: 

It  also  updates  the  weights  in  the  hidden  layer  of  the  basic  network  by  AWy  which  is  given 
by: 

A  *Xi  *exp  { ixj  -lUi  I ) 


Where  x  is  a  normalization  factor. 


In  this  paper,  a  modification  has  been 
provided  to  the  above  architecture  to  give 
the  flexibility  to  find  also  the  mappirg 
features  in  tte  desired  outputs,  as  shown 
in  Figure-2.  The  irq)ut  control  network 
nnds  a  mapping  from  the  input  pattern 
space  z  G  ^  onto  a  regular  two- 
dimensional  array  in*k). 


(Figure-2)  A  Modular  Recurrent  Architecture 

As  shown  in  Figure  2,  another  Self-Organization  Map  (SOM)  network  (the  output  control)  is 
added  to  find  a  mapping  from  the  desired  output  pattern  space  y  E’SH’  onto  an  array  (q*d), 
where  J  is  the  number  of  the  desired  outputs.  While  the  irq>ut  control  network  finds  the  best¬ 
matching  nodes  in  the  iiqiut  patterns,  the  output  control  network  tries  to  find  the  best- 
matching  nodes  in  the  output  patterns  which  can  be  represented  as: 
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The  Learning  phase  of  this  architecture  has  two  steps.  In  the  first  step,  the  basic  network 
updates  the  weights  until  the  error  reaches  a  saturation  level  (0.0001  of  error  versus  the 
epoch).  The  input  and  the  output  control  networks  finds  the  best  matching  nodes  for  both  of 
the  ii^iut  patterns  and  the  desired  outputs.  In  the  second  step,  the  control  networks  updates 
the  weights  of  the  outputs  of  the  basic  network  using  a  Gaussian  potential  function  by  AWy 
which  is  given  by  ; 

*yi  *  exp 


Where  x  is  an  output  normalization  factor  and  a  is  the  standard  deviation  for  the  desired 
outputs. 

At  the  same  time  ,  it  updates  the  weights  in  the  hidden  layer  of  the  basic  network  by 
which  is  given  by: 

Aw^j=r\  *x^*exp  { 

Where  t)  is  an  input  normalization  factor  and  xp  is  the  standard  deviation  for  the  given  input 
patterns. 

This  modular  recurrent  architecture  and  its  learning  algorithm  test  three  examples:  (1) 
Exclusive-OR  Problem,  (2)  IRIS  plant  data,  and  (3)  3-partition  nonlinear  functions  (e.g., 
logarithmic,  exponential,  multiply).  Table-1  shows  a  comparison  between  three  architectures: 
the  modular  feed-forward  architecture,  the  modular  recurrent  architecture,  and  Jacobs’s 
mixed  networks  model. 


No.  of  Epoch 

RMS.  Output  Error  % 

XOR-Problem 

Mixed  Expert  Networks(Jacobs) 

1267 

8% 

Modular  Feed-Forward  Arch. 

843 

3.5% 

Modular  Recurrent  Arch. 

904 

2.3% 

1BI$  D^ta 

Mixed  Expert  Networks(Jacobs) 

4389 

14.1% 

Modular  Feed-Forward  Arch. 

2493 

12.6% 

Modular  Recurrent  Arch. 

1783 

9.8% 

3-Partition  Function 

Mixed  Expert  Networks  (Jacobs) 

659 

2.1% 

Modular  Feed-Forward  Arch. 

672 

3.2% 

Modular  Recurrent  Arch. 

865 

1.9% 
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Concluskm 


In  this  research,  we  have  designed  a  connectionist  system  by  combining  the  dynamic  weights 
with  the  modular  recurrent  architecture  to  classify  binary  and  continuous  patterns.  This 
system  includes  the  dynamic  weights  as  a  multiplicative  Gaussian  function  connections  which 
allows  it  to  determine  the  mapping  from  the  given  patterns  to  the  output  patterns  and  switch 
between  different  input  spaces.  Currently,  we  are  still  testing  more  non-linear  dependent 
patterns  in  different  signd  processing  application.  At  the  same  time,  more  mathematical 
analysis  are  being  smdied  to  prove  some  tl^ories. 
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Abstract 

The  use  of  multilayer  perceptron  (MLP)  neural  network  (NN)  as  human  chromosome  classifier  was 
studied.  The  MLP  NN  classifier  was  optimized  in  the  sense  of  learning  rate,  momentum  constant  and 
training  cycle,  to  the  chromosome  data.  The  MLP  classifier  learning  curves  were  examined  by  measuring 
the  probability  of  correct  test  set  classification  for  an  increasing  number  of  training  examples.  Only  10-20 
examples  were  required  to  the  MLP  NN  classifier  to  reach  it  ultimate  performance  regarxling  the  number 
of  features  used.  To  compare  the  results  to  relevant  theory,  we  have  calculated  the  entropic  error  (loss). 
The  empirical  dependence  of  the  entropic  error  on  the  number  of  examples  is  highly  comparable  to  the 
1/t  function  that  is  a  universal  learning  curve. 

1.  Introduction 

Human  chromosomes  are  responsible  for  about  50%  of  early  fetal  losses,  5%  of  late  fetal  losses  and 
20%  of  birth  defects  [16].  No  wonder  that  karyotyping,  the  procedure  of  chromosome  analysis,  is  a 
comer  stone  of  prenatal  diagnosis.  The  Canadian  Woridoad  Measurement  System  [3]  allocates  465 
minutes  for  karyotyping  amniocytes,  the  most  common  diagnostic  activity  in  cytogenetics.  Most  of  the 
time  is  dedicated  to  microscopy,  a  tedious,  eye  straining  task  requiring  meticulous  attention  to  details. 
Obviously  it  needs  highly  qualified,  therefore,  well-paid  personal.  As  today,  the  analysis  of  chromosomes 
is  the  limiting  factor  in  the  wide  application  of  cytogenetics  as  a  diagnostic  tool.  The  commercially 
available  computerized  systems  for  chromosome  sorting  are  of  great  help  but  still  inadequate.  The 
systems  are  definitely  inferior  to  the  human  performer.  First  and  most  important,  these  are  expensive,  non 
automatic  devices  that  need  human  assistance  throughout  the  process. 

Neural  networirs  make  it  possible  to  overcome  most  of  these  limitations.  This  is  mainly  because  they 
permit  application  of  expert  knowledge  and  experience  through  network  training.  The  neural  network 
classifier  has  the  advantage  of  being  fast  (highly  parallel),  easily  trainable  and  capable  of  creating 
arbitrary  partitions  of  the  feature  space.  Multilayer  perceptron  neural  networks  have  been  used  in  several 
studies  of  biological  object  classification.  In  a  research  to  evaluate  the  growth  of  tumors  in  mice,  an  MLP 
neural  network  trained  by  the  backpropagation  learning  algorithm  [14]  was  able  to  distinguish  among 
seven  stages  in  tumor  growth  [4].  In  another  investigation  [15],  the  MLP  trained  to  classify  cervical  cell 
images,  as  either  normal  or  abnormal.  The  classifier  correctly  classified  96%  of  the  cell  images  in  the  test 
set.  In  a  similar  study,  an  MLP  NN  used  to  classify  cells  for  cancer  diagnosis  with  probability  of  correct 
classification  of  about  96%  [12].  An  attempt  to  train  an  MLP  NN  to  define  and  detect  DNA-binding  sites 
was  done  [13],  with  80%  probability  of  correct  detection.  The  only  known  effort  to  classify  human 
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chromosome  images  using  NN,  besides  the  work  of  our  group  [S]-[10],  is  described  in  [2].  This  reference 
is  an  abstract  only.  The  classification  was  made  using  the  Fourier  coefficients  of  the  medial  axis  and  an 
MLP  NN  and  yielded  92.5%  probability  of  correct  classification  for  the  test  set. 

An  effort  was  made  through  the  last  two  years  to  utilize  neural  netwcnks  as  a  human  chromosome 
classifier  [S]-[10].  This  effort  has  been  mainly  concentrated  on  the  feature  extraction  and  selection  issues. 
This  research,  on  the  other  hand,  has  focused  on  the  optimization  of  an  MLP  NN  as  a  human 
chromosome  classifier  of  S  chromosome  types.  In  addition,  the  learning  curves  of  the  MLP  classifier 
were  empirically  investigated  and  compared  to  the  theory. 

2.  The  MLP  classilier 

In  this  research,  a  two-layer  feedforward  neural  network  trained  by  the  backpropagation  (bp)  learning 
algorithm  [14]  was  chosen  for  the  classification.  The  bp  algorithm  is  an  error  driven  parameter  estimation 
algorithm  where  the  objective  is  to  minimize  the  output  squared  error  function  by  adjusting 
interconnection  weights  and  node  thresholds.  Learning  is  controlled  by  the  values  selected  for  the 
learning  rate  and  the  momentum  constant.  No  rule  for  selecting  the  optimal  values  of  these  parameters 
exists  and  usually  they  ate  chosen  empirically  according  to  the  training  data.  Hie  number  of  hidden 
layers  and  the  number  of  hidden  units  in  each  hidden  layer  affects  the  shape  of  the  decision  regions  of  the 
classifier,  therefore  affects  the  classification  performance  and  complexity.  In  this  study,  two  layer 
perceptron  was  considered.  The  network  was  initialize  using  random  wei^ts  in  the  [-1,1]  range.  Tlie 
input  vector  was  64-dimensional  and  the  output  vector  was  5  dimensional  with  one  component  set  to  "1" 
(actually  0.9)  for  the  correct  classification  and  "0"  (actually  0.1)  elsewhere  [9].  The  number  of  hidden 
units  of  the  network  was  set  according  to  the  Principal  Component  Analysis  (PCA),  rqiplied  to  the  feature 
vectors.  The  number  was  set  to  be  the  number  of  the  largest  eigenvalues,  the  sum  of  which  accounts  for 
more  than  a  pre-specified  percentage  of  the  sum  of  all  the  eigenvalues  [9].  In  all  the  simulations,  this 
number  was  set  according  to  a  pre-specified  percentage  of  90%. 

3.  Leaniing  curves 

Learning  curves  show  how  fast  the  behavior  of  a  machine  improves  as  the  number  of  training 
examples  increases.  There  are  several  approaches  to  this  problem,  e.g.,  the  statistical-mechanical 
approach,  the  information-theotetic  approach  and  the  statistical  approach.  All  of  these  B'^’^roaches  suggest 
that  the  average  error  decreases  universally  in  the  order  of  1/t,  where  t  is  the  t  of  training 
exartqrles.  A  universal  property,  that  irrespective  of  the  machine  architecture,  the  aveta^c  entropic  loss 
decreases  asynq)totically  as  d/t,  where  d  is  the  number  of  modifiable  parameters  of  the  classifier,  has 
been  proved  [1].  The  average  entropic  loss  is  the  average  of  the  logarithm  of  the  probability  of  correct 
classification  for  the  next  new  pattern  after  a  classifier  has  been  trained  by  t  training  examples.  Moreover, 
when  the  classifier  error  tends  to  zero  (or  the  probability  of  correct  classification  tends  to  1)  the  average 
entropic  loss  and  the  generalization  error  are  almost  identical  [1]. 

In  this  study,  we  measured  the  probability  of  correct  test  set  classification  while  the  number  of 
training  examples  increased.  The  maximum  number  of  examples  was  set  by  the  minimum  number  of 
training  vectors  over  all  classes  (chromosome  types).  The  experiment  was  repeated  with  a  different 
number  of  features  selected  by  the  "knock-out”  algorithm.  The  "knock-out"  algorithm  is  a  feature 
selection  method,  where  the  best  features  among  the  extracted  features  are  selected  using  the 
effectiveness  (scattering)  criterion  of  "minimum  variance"  [17]. 

In  addition,  the  entropic  error  (loss)  e’''(t)  [1], 

1)  e*(t)  =  -log(Ptest) 
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was  calculated  and  compared  to  the  theoretical  curve  is  the  probability  of  correct  test  set 

classification). 

4.  Results 

4. 1  The  MLP  NN  optimization 

A  description  of  the  methodology  and  the  features  we  have  used  appears  elsewhere  [5]-[9].  Three 
parameters  of  the  classifier,  namely  the  training  cycle  (in  epochs  units),  the  momentum  constant  (a)  and 
the  learning  rate  (^)  were  checked  in  order  to  find  the  best  network.  All  the  simulations  were  repeated  (at 
least)  3  times,  with  the  same  network  parameters  but  with  different  sets  of  randomly  chosen  training 
vectors,  and  the  results  were  averaged.  The  probability  of  correct  training  and  test  sets  classification  is 
plotted,  in  Figure  1,  against  the  training  cycle  (epochs).  Training  is  made  in  batch  mode,  which  mean  that 
the  network  weights  are  changed  only  a^r  each  presentation  of  all  the  vectors  to  the  network  (epoch). 
We  can  see  that  the  ultimate  learning  is  obtained  for  the  first  S00-1(XX)  epochs  (and  with  Sum  Square 
Error  (SSE)  of  less  than  4).  However,  training  cycle  in  all  the  simulations  was  kept  to  be  4000  epochs. 


Figure  1.  The  probability  of  correct  test  classification. 

The  sensitivity  of  the  classification  procedure  to  the  momentum  constant  (a)  and  to  the  learning  rate 
(p)  is  shown  in  figure  2  and  Figure  3,  respectively.  Best  generalization  was  obtained  when  the 
momentum  constant  and  the  learning  rate  were  equal  to  0.97  and  0.026,  respectively.  Therefore,  all  the 
simulations  were  held  with  these  3  values  of  parameters:  training  cycle  of  4000  epochs,  a=0.97  and  p 
=0.026. 

Using  these  parameters,  the  MLP  classifier  was  almost  perfectly  (99.3-99.6%)  trained  to  classify 
chromosomes  of  5  types  and  yielded  over  98%  of  probability  of  correct  test  classiEcation  [7],  [9]. 
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Figure  2.  The  correct  test  set  probability  vs.  the  momentum  constant. 
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Figure  3.  The  correct  test  set  probability  vs.  the  learning  rate  and  for  the  optimal  momentum  constant 
yielded  at  Figure  2. 


4.2  Learning  curves 

The  probability  of  correct  test  set  classification  was  measured  when  the  number  of  training  examples 
increased.  The  maximum  number  of  examples  was  84  that  is  the  smallest  number  of  training  vectors  in 
one  of  the  chromosome  classes.  First,  the  MLP  network  was  trained  using  only  one  example  for  each 
chromosome  type  and  the  probability  of  correct  test  set  classification  was  calculated.  Then,  another 
example  for  each  chromosome  type  was  added  to  the  training  set  and  the  new  probability  of  correct  test 
set  classification  was  calculated.  The  procedure  continued  until  all  available  examples  (84)  were  used. 
The  experiment  was  repeated  3  times  for  a  different  number  of  selected  features,  namely  10,  20  and  60 
features.  In  each  case,  the  features  were  the  "best"  features  we  can  select  according  to  the  "knock-out" 
algorithm  [17].  The  results  are  shown  in  Figure  4.  Only  10-20  examples  are  required  to  the  MLP  NN 
classifier  to  reach  it  ultimate  performance  regarding  the  number  of  features  used.  The  entropic  error  (loss) 
has  been  calculated  in  order  to  compare  the  results  to  the  theory  outlined  before.  The  dependence  of  the 
entropic  error  on  the  number  of  examples  is  shown,  for  the  best  60  features,  in  Figure  5.  The  results  are 
very  close  approximated  by  the  1/t  function  which  is  a  universal  learning  curve  [1]. 
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Figure  4.  The  of  correct  test  set  classification  vs.  the  number  of  training  examples  for  3 

different  values  of  selected  features  ("o"  for  10  features.  ”x"  for  20  features  and  for  60  features). 


nuntboT  ct  ntwiploi 

Figure  5.  The  entropic  error  (loss)  vs.  the  number  of  training  examples,  for  the  best  60  features  ("x"), 
con^iarBd  to  a  universal  learning  curve  in  the  order  of  1/t  (solid  line). 


5.  Discusdon 

The  multilayer  perception  (MLP)  neural  network  (NN)  was  used  to  classify  human  chromosomes.  The 
NN  classifier  was  optimized  to  the  chromosome  data  in  the  sense  of  training  cycle,  learning  rate  and 
momentum  constant.  On  the  basis  of  this  optimization,  the  MLP  NN  classifier  was  almost  perfectly 
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(99.3-99.6%)  trained  to  classify  chromosomes  of  S  types  and  yielded  over  98%  of  probability  of  correct 
test  classification  [7]. 

The  MLP  classifier  learning  curves  were  investigated  by  the  calculation  of  the  probability  of  correct 
test  set  clas»fication  where  the  number  of  training  examples  was  increased.  Only  10-20  examples  were 
requited  to  the  MLP  NN  classifier  to  reach  it  ultimate  performance  regarding  the  number  of  features 
used.  To  compare  the  results  to  a  relevant  theory,  we  have  calculated  the  entropic  error  (loss).  The 
dependence  of  the  entropic  error  on  the  number  of  examples  is  highly  comparable  to  the  1/t  function 
which  is  a  universal  learning  curve  [1]. 
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Abstract 

A  neuro-muscular  signal,  during  muscle  contraction,  is  composed  of  multiple  train  of  ran- 
dtxnly  distributed  (Poisson  process)  action  potentials  innervated  by  a  number  of  motor  units 
fr(»n  spinal  cord.  In  order  to  study  the  motor  control  strategy  and  diagnosis  of  various  neuro¬ 
muscular  diseases,  many  investigators  have  used  techniques  of  pattern  recognition  to  decompose 
the  muscle  signal  into  its  coiutituent  action  potential  trains.  In  this  work,  a  Hopheld  network 
model  is  used,  for  its  real-time  chip  fabrication  possibility,  to  solve  this  inverse  problem  of  neuro¬ 
muscular  system  by  decomposing  muscle  signal.  A  muscle  signal  is  simulated  which  is  composed 
of  five  different  patterns  of  action  potential  trains.  The  shapes  of  the  simulated  action  poten¬ 
tials  are  derived  from  typical  real  muscle  signals.  Signal  templates,  one  from  each  class  of  action 
potentials,  are  converted  into  its  bipolar  bitmap  patterns  and  are  used  to  compute  the  network 
weights  and  biases.  Each  detected  action  potential  in  the  signal  is  then  converted  to  its  equiv¬ 
alent  bitmap  pattern  and  then  presented  to  the  Hopfield  network  for  iterative  stabilization  to 
its  closest  attractor  in  the  network  state  space.  The  system  is  performing  very  good  recognition 
with  94%  detection  rate  even  when  the  entire  signal  was  corrupted  with  20%  noise  for  a  normal 
clinical  measurement  environment.  The  result  is  very  promising  for  real-time  implementation. 
Applying  this  noethod  to  surface  muscle  signals,  there  is  a  good  potential  to  develop  multiple 
spike  train-based  biological  control  algorithms  for  prosthesis. 


1  Introduction 

A  neuro-muscular  signal,  also  called  myoelectric  (ME)  or  electromyographic  (EMG)  signal,  arises 
from  the  depolarization  of  muscle  fibers  following  the  discharge  of  the  innervating  motor  neuron. 
The  depolarization  wave,  called  an  action  potential  (AP),  propagates  along  the  muscle  fibers.  In 
normal  skeletal  muscle,  the  fibers  never  contract  as  individuals.  Instead,  a  small  groups  of  them 
contract  in  concert.  Such  a  group  of  muscle  fibers  is  supplied  by  the  terminal  branches  of  the  nerve 
fiber  or  axon  whose  cell  body  is  in  the  anterior  horn  of  the  spinal  grey  matter.  This  nerve  cell 
body,  plus  the  long  axon,  plus  its  terminal  branches  and  all  the  muscle  fibers  together  constitute  a 
motor  unit  (MU).  The  action  potential  of  a  MU  is  called  a  motor  unit  action  potential  (MUAP). 

In  a  sustained  muscle  contraction,  the  motor  units  must  be  repeatedly  activated.  The  result¬ 
ing  sequence  of  MUAPs  is  called  a  motor  unit  action  potential  train  (MUAPT).  The  waveform 
of  the  MUAPs  within  a  MUAPT  will  remain  constant  provided  that  the  geometric  relationship 
between  the  electrode  and  the  active  muscle  fibers,  the  properties  of  the  recording  electrode,  and 
the  biochemical  properties  of  the  muscle  tissue  all  remain  constant. 
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Fignre  1:  A  Schematic  of  a  Neuromuscular  System 

The  muscle  fibers  belonging  to  the  same  motor  unit  are  distributed  randomly  within  the  muscle 
rather  than  being  clustered  together,  and  can  be  spread  throughout  a  territory  occupying  as  much 
as  30%  of  the  muscles  cross  sectional  area.  As  a  result,  neighboring  muscle  fibers  generally  belong 
to  different  motor  units,  and  in  any  small  r^on  there  can  be  fibers  from  as  many  as  50  motor 
units  Therefore,  a  single  MU  APT  is  observed  when  the  fibers  of  only  one  motor  unit  in  the  vicinity 
of  the  electrode  are  active.  Such  a  situation  occurs  only  during  a  very  weak  muscle  contraction. 
As  the  force  output  of  a  muscle  increases,  motor  unit  having  fibers  in  the  vicinity  of  the  electrode 
become  activated,  and  several  MUAPTs  will  be  detected  simultaneously. 

One  of  the  the  research  goals  in  this  field  is  that  how  to  decompose  or  inverse  process  the  muscle 
signal  into  its  constituent  MUAPTs  in  order  to  study  the  motor  control  strategy  and  to  improve 
the  diagnostic  accuracy  of  the  clinical  EMG  examination.  Many  investigators  have  used  modem 
techniques  of  pattern  recognition  to  decompose  the  muscle  signal  into  its  constituent  Ml.  APTs.  In 
this  work,  a  Hopfield  network  model  is  used,  for  its  real-time  chip  fabrication  possibility,  to  solve 
this  inverse  problem  of  neuromuscular  system  by  decomposing  muscle  signal  and  proposed  an  idea 
for  developing  decomposed  multiple  impulse  train- based  prosthesis  or  robot  controller. 

2  Neuromuscular  System  Model 

A  schematic  of  a  neuromuscular  system  model  is  shown  in  the  Figure  1.  The  Dirac  delta  impulses 
6i(t)  are  fired  by  the  alpha-motoneurones  which  travel  down  the  axons  and  then  stimulate  the 
muscle  fibers.  Let  us  assume  that  there  are  Ln  muscle  fibers  connected  with  n**  motor 
unit  and  Mn  be  the  total  number  of  firings  in  a  contraction  time  Tc,  fired  by  the  n**  motor  unit. 
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The  minimiinri  level  of  system  configuration  is  a  muscle  fiber  connected  to  a  nerve  branch.  The 
system  input  is  an  impulse  and  the  output  of  this  system  is  an  impulse  response  which  is  called 
single  fiber  action  potential  (SFAP).  Thus  the  input  train  of  impulses  can  be  represented  as: 

«.(!)=  E<(‘- ‘5.)  0) 

m=l 

where 

m 

=  m,j  =  l,2,3, . ,Mn  (2) 

j=i 

In  the  above  expressions,  t  is  the  continuous  time  variable,  tm  represents  the  time  locations  of  the 
impulses  Xj.  Therefore,  the  equation  for  the  impulse  response  train  of  a  single  muscle  fiber  (f‘^)  is 
given  by: 

Afn 

Wni(0  ~  ^  \  ~  ^mn)  (3) 

m=l 

Where  h  is  the  impulse  response  of  muscle  fibre.  A  impulse  response  train,  also  called  MU  APT, 
from  all  the  muscle  fibers  belonging  to  the  motor  unit  is  then  obtained  by  the  spatial  summation 
over  all  the  muscle  fibers,  innervated  by  the  motor  unit  as: 

U  Mn 
/ssl  m=l 

Again  the  EMG  signal  as  they  are  detected  by  the  electrode  is  the  spatial  summation  of  MUAPTs 
from  all  N  motor  units  resulting  a  multi-train  EMG  signal  and  can  be  represented  as: 

emgit)  =  tt„(t)  =  XI 5Z  l!  (5) 

nsl  nsl  {si  msl 

The  signal  ^ven  by  the  above  expression  is  the  physiological  EMG  signal  and  is  not  observable. 
When  the  signal  is  detected,  an  electrical  measurement  noise  w(t)  is  introduced  from  various  sources. 
The  detected  signal  will  also  be  affected  by  the  spatio-temporal  convolution  by  the  nerve-muscle 
fibers  belonging  to  a  number  of  MU.  Thus  the  resulting  observable  EMG  signal  can  be  expressed 
as: 

N 

EMGit)  =  ^n(0  *  Mn(f )  +  u;(t)  (6) 

11=1 

where  s„(t)  is  the  point  spreading  function  in  space  and  time  sparse  in  discrete  index  n. 

In  a  continuously  force  varying  contraction  the  parameters  Mn  and  N  are  directly  force  de¬ 
pendent.  Therefore,  the  final  EMG  signal  will  be  a  function  of  both  time  t  and  muscle  force  F. 
This  EMG  signal  can  be  detected  by  intramuscular  or  by  surface  electrodes.  In  the  clinical  EMG 
laboratory,  the  conventional  bipolar  and  monopolar  needle  electrodes  are  usually  used. 

3  Brief  History  of  EMG  Decomposition  Problem 

The  problem  of  decomposition  of  any  multi- train  signal  is  an  important  issue  for  the  pattern 
recognition  and  signal  processing  society.  It  is  important  because  by  decomposing  such  a  signal, 


in-256 


during  musculnr  contraction,  the  clinical  electromygraphers  are  able  to  observe  the  recruitment 
pattern  of  each  motor  unit  that  has  been  recruited.  In  the  following,  a  brief  history  of  EMG 
deccHupontion  is  presented. 

Bergmans  [1]  has  described  an  on-line  computer  program  that  can  identify  individual  MUAPs 
automatically.  The  program  stores  the  first  potential  it  detects,  may  be  either  a  valid  MUAP  or 
a  superimposition,  and  compares  it  with  the  next  six  potentials.  If  it  matches  one  of  them,  it  is 
assumed  to  be  a  valid  MUAP;  if  it  doesn’t,  it  is  assumed  to  be  a  superimposition.  Up  to  five  unique 
MUAPs  are  collected  per  recording  site,  and  histograms  of  their  amplitude,  durations,  numbers 
of  phases  (fluctuations  that  cross  the  baseline)  are  compiled.  The  program  compares  MUAPs 
automatically  on  the  basis  of  maximum  sample-to-sample  difference  after  alignment  by  threshold 
crossing  points. 

Prochazka  et  al.  [2]  have  described  a  computer  method  for  clinical  firing  pattern  analysis.  The 
EMG  is  recorded  during  a  low  level  contraction  using  a  bipolar  wire  electrode  and  is  analyzed 
interactively  off-line.  The  operator  plays  back  the  EMG  at  slow  speed  and  selects  up  to  four 
distinct  MUAPs  as  templates.  The  program  then  attempts  to  identify  the  subsequent  potentials 
by  template  matching,  using  a  least-square-error  criterion,  and  is  able  to  resolve  superimpositions 
involving  two  MUAPs  automatically. 

LeFever,  Mambrito  and  De  Luca  [3,  4]  have  developed  a  method  for  studying  motor-unit  firing 
behavior  during  maximal  contractions  by  decomposing  the  EMG  signal  recorded  by  a  selective  mul¬ 
tipolar  needle  riectrode.  The  method  is  primarily  intended  for  physiological  investigations  rather 
than  for  clinical  use,  and  is  designed  to  perfectly  identify  every  firing  up  to  eight  simultaneously 
active  MUAPs.  The  method  uses  three  data  channels,  obtained  between  different  lead-off  pairs 
of  the  multipolar  electrode,  to  increase  MUAP  distinguishability.  It  identifies  the  MUAPs  using 
a  template  matching  procedure  that  takes  into  account  both  waveshape  and  relative  firing  likeli¬ 
hood  ^ven  the  past  firing  history,  and  it  can  resolve  superimpositions  involving  two  MUAPs.  It 
can  track  slow  changes  in  MUAP  waveshape,  firing  rate,  and  firing-rate  variability.  Only  records 
derived  from  attempted  isometric  contractions  have  been  decomposed.  In  order  to  achieve  desired 
high  level  of  identification  performance,  extensive  interaction  by  a  trained  operator  is  needed  to 
create  the  templates,  identify  potentials  about  which  the  program  is  uncertain,  and  double-check 
the  program’s  identifications.  The  method  has  been  able  to  reveal  some  interesting  new  phenomena 
related  to  motor-unit  behavior  but  is  far  too  laborious  for  routine  clinical  use,  requiring  as  much 
as  one  hour  of  computation  time  per  second  of  data. 

Moschytz,  De  Figueiredo,  and  colleagues  [5,  6]  have  presented  an  advanced  method  for  de¬ 
composing  single-channel  EMG  recorded  with  standard  concentric  EMG  needle  electrodes.  This 
method  can  extract  up  to  seven  sets  of  simultaneously  active  MUAP  waveforms  and  firing  patterns 
from  partial  IPs  of  moderate  complexity.  The  analysis  consist  of  three  steps.  The  first  is  the 
learning  phase  in  which  the  number  of  different  active  MUAPs  and  their  waveforms  are  determined 
on  the  basis  of  shape-related  features,  independent  of  firing  times.  Next  is  a  decomposition  phase 
which  reanalyzes  the  signal  in  an  attempt  to  detect  all  the  firings  of  each  identified  MUAP.  The 
program  attempts  to  resolve  superimpositions  using  a  sophisticated  algorithm  that  is  able  to  opti¬ 
mally  scale  and  align  up  to  three  of  the  MUAP  templates  to  find  the  best  match  to  each  observed 
superimposition  potential. 

McGill,  Dorfrnan  and  Cummins  [7,  8,  9]  have  described  a  different  method  of  EMG  analysis, 
which  they  refer  to  as  Automatic  Decomposition  EMG  (ADEMG),  which  is  specifically  oriented 
toward  clinical  application.  ADEMG  operates  upon  single-channel  EMG  signal  of  moderate  com¬ 
plexity  -  corresponding  to  contractile  forces  of  up  to  30%  MVC  -  recorded  using  standard  EMG 
electrodes,  and  decomposes  them  into  their  constituent  MUAPs.  The  analysis  is  performed  in 
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ilgure  2:  Block  Diagram  of  The  Proposed  Inverse  Neuromascular  System  for  Diagnosis  and  Control 

nearly  read  time  (less  than  1  minnte),  a  speed  achieved  by  not  attempting  to  resolve  every  occu* 
ranee  of  MUAP  and  by  several  computational  innovations.  Only  steatdy  isometric  contractions 
are  decomposed,  so  that  MUAP  waveshapes  and  firing  rates  can  be  accurately  estimated  despite 
incomplete  identification.  The  method  is  now  also  available  in  hardware. 


4  Signal  Decomposition  Using  Hopfield  Network 

A  block  diagram  of  the  proposed  Hopfield  network  chip-based  inverse  neuro-muscular  decomposition 
system  is  shown  in  Figure  2.  A  muscle  signal  is  simulated  which  is  composed  of  five  different  patterns 
eff  action  potential  trains.  The  sh^es  of  the  simulated  action  potentials  are  derived  from  typical 
real  muscle  signals.  Signal  templates,  one  from  each  class  of  action  potentials,  are  converted  into  its 
bipolar  bitmap  patterns  and  are  used  to  compute  the  network  weights  and  biases.  This  multi-train 
agnal  is  then  corrupted  with  20%  noise,  as  compared  to  the  average  action  potential  amplitude, 
for  a  normal  measurement  environment.  A  moving  window-based  spike  detector  is  designed 

for  the  detection  of  the  presence  of  action  potentials  as  the  window  moves  along  the  entire  £MG 
data.  The  spikes  are  identified  whenever  they  crossed  a  predetermined  threshold.  Once  there  is 
a  crossing  of  the  threshold  value,  the  detector  backs-up  of  two  data  points  and  picks-up  a  total 
of  12  consecutive  data  points  to  represent  the  captured  action  potential.  Each  detected  action 
potential  in  the  signal  is  then  converted  to  its  equivalent  bitmap  pattern.  This  bitmap  pattern 
is  then  presented  to  the  Hopfield  network  for  iterative  stabilization  to  its  closest  attractor  in  the 
network  state  space  which  outputs  the  recognized  signal  pattern. 

A  Hopfield  model  is  used  for  pattern  storage  [10].  The  captured  action  potential  of  12  data 
points  is  converted  in  to  a  2-D  bitmap  (20X20)  using  bipolar  binary  notation.  The  matrix  element 
is  valued  a  ‘^1”  if  there  is  a  presence  of  data  value  within  the  space  resolution  as  defined  by  the 
bitmap  matrix.  Otherwise  the  matrix  elements  are  valued  as  ‘*-1”.  The  bitmap  conversion  is  the 
first  step  before  the  pattern  is  to  be  presented  to  the  Hopfield  network.  Five  different  prototype 
templates  are  thus  converted  to  their  bitmap  patterns  and  are  used  to  compute  the  weights  of  a 
Hopfield  auto-associative  network. 

The  noisy  EMG  signal  is  then  presented  to  the  spike  detection  system.  Once  the  signal  detector 
detects  a  signal,  the  signal  is  converted  to  its  bitmap  pattern  and  are  presented  to  the  Hopfield 
network  for  classification.  After  the  network  recei  ves  the  corrupted  pattern  it  goes  through  iterative 
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EMG  Data  Number 

Figure  3:  The  Corrupted  EMG  Signal  and  Its  Decomposed  MUAPTs 


state  tranation  until  it  stabilizes  into  a  closet  attractor.  After  stabilization,  the  network  stops 
iteration  process  and  outputs  the  classification  results.  The  decomposition  system  graphically  plots 
the  corrupted  multi-train  EMG  signal  and  also  plots  the  classified  action  potentials  into  decomposed 
channels  as  shown  in  Figure  3. 

5  Results  and  Conclusions 

The  EMG  signal  decomposition  is  very  important  for  clinical  neurophysiologists  for  diagnosis  of 
various  neuro-muscular  abnormalities.  The  firing  patterns  of  individual  motor  unit  can  also  be 
used  for  the  study  of  motor  control  properties.  The  decomposition  of  neuro-muscular  signal  using 
a  Hopfieid  network  model  is  a  success  experiment  and  can  be  easily  implemented  for  real-time 
application  using  Hopfieid  neuro-chip.  Recognition  and  decomposition  of  EMG  patterns  may  also 
serve  an  urgent  need  in  deriving  biological  control  algorithms  which  can  be  applied  to  control 
prosthesis. 

From  the  simulated  signal,  with  20%  added  noise,  the  network  has  decomposed  30  spikes  cor¬ 
rectly,  into  their  respective  motor  channel,  out  of  a  total  32  spikes  in  the  corrupted  signal.  Which 
gives  a  success  rate  of  nearly  94%.  The  success  rates  with  10%  and  40%  noise  are  also  found  to  be 
100%  and  50%  respectively.  In  this  experiment,  the  prototype  templates  could  also  be  adaptively 
updated  each  time  a  new  action  potential  is  detected.  This  allows  the  stored  template  to  adapt 
the  gradual  changes  during  the  MUAP  recruitment.  However,  in  this  experiment  the  templates, 
that  are  stored  initially,  are  not  updated.  It  should  also  be  noted  here  that  the  use  of  bitmap  con- 
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venioa  technique  eliminates  the  alignment  problem  that  occurs  in  time  domain  template  matching 
technique. 
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Abstract 

The  Radiation  Signature  (RS)  paradigm  takes  aim  at  practical  applications  of  the 
knowledge  provided  by  molecular  studies  of  radiation-matter  interactions  in  DNA.  The 
central  proposition  is  the  idea  that  the  distribution  of  molecular  lesions  (i.e.,  a  molecular 
lesion  spectrum,  MLS)  generated  in  DNA  by  exposure  to  a  particular  radiation  is  a 
characteristic  of  that  causal  radiation  (i.e.,  is  a  RS).  We  have  found  that  adaptive  neural 
networks  provide  an  efficient  way  to  validate  that  proposition  Feature  recognition 
techniques  become  necessary  when  one  deals  with  data  bases  that  are  less  than  q)timal, 
when  one  quires  the  minimum  number  of  lesions  adequate  for  signature,  or  \«iien  one 
wishes  to  pursue  the  mote  tenuous  and  presumptive  connection  of  a  radiation  and  its 
clinical  outcome.  The  focus  of  this  wod:  is  on  the  modeling  and  interpretation  of  RS’s 
using  neural  network  processing.  The  specific  goals  of  RS  research  in  the  domain  of 
feature  extraction  and  nxxleling  are:  1)  to  use  trained  neural  networks  to  extract 
signatures  aixl  maricers  from  molecular  lesion  sets  of  various  sizes,  aixl  to  identify  diese 
DNA  lesions  which  serve  as  markers  of  specific  radiation,  2)  to  investigate  which 
methods,  including  various  neural  netwoilcs  and  various  architectures,  can  suggest  the 
molecular  lesion  sets  that  are  most  appropriate  for  particular  purposes,  and  3)  to  use 
adaptive  neural  networks  for  interpretation  of  results.  Although  efforts  to  identify 
products  of  radiation  that  are  specific  to  radiation  type  and  to  link  these  with  biological 
responses  are  almost  a  century  old,  the  RS  concept  has  provided  the  first  quantitative 
confirmation  of  such  causal  relations. 

1)  K.  Rupnik,  and  S.  P.  McGlynn  "Molecular  Lesion  Spectra  as  Radiation  Signatures" 
Spectroscopy  Letters,  26  5  873  (1993). 

2)  S.  P.  McGlyrm,  K.  Rupnik,  M.  N.  Varma  and  L.  Klasinc,  "Radiation  Signatures  and 
Radiation  Markers",  Radiation  Protection  Dosimetry  (in  press,  1994). 
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Ability  of  the  3D  Vector  Version  of  the 

Back-Propagation  to  Learn  3D  Motion 

Tohru  Nitta 

Electrotechnical  Laboratory, 

1-1-4  Umezono,  Tsukuba  Science  City,  Ibaraki,  305  Japan. 


Abstract  : 

The  3D  vector  version  of  the  back-propagation  algorithm  (called  “3DV-BP”)  is  a  nat¬ 
ural  extension  of  the  complex-valued  version  of  the  back- propagation  algorithm  (called 
“Complex-BP”).  The  Complex-BP  can  be  applied  to  multi-layered  neural  networks  whose 
weights,  threshold  values,  input  and  output  signals  are  all  complex  numbers,  and  the  3DV- 
BP  can  be  applied  to  multi-layered  neural  networks  whose  threshold  values,  input  and 
output  signeds  are  all  3D  real  valued  vectors,  and  whose  weights  are  all  3D  orthogonal 
matrices.  It  has  already  been  reported  that  an  inherent  property  of  the  Complex-BP  is 
its  ability  to  learn  “2D  motion”.  This  paper  shows  in  computational  experiments  that 
the  3DV-BP  has  the  ability  to  learn  “3D  motion”,  which  corresponds  to  the  ability  of  the 
Complex-BP  to  learn  “2D  motion”. 

1  Introduction 

One  of  the  most  popular  neural  network  models  is  the  multi-layered  network  and  the  re¬ 
lated  back-propagation  training  algorithm,  called  here,  “Real-BP”  [7].  Back-propagation 
networks  have  many  successful  applications. 

The  “Complex-BP”  is  a  complex  valued  version  of  the  back-propagation  algorithm, 
which  can  be  applied  to  multi-layered  neural  networks  whose  weights,  threshold  values, 
input  and  output  signals  are  all  complex  numbers  [1,  3].  This  algorithm  enables  the 
network  to  learn  complex  valued  patterns  naturally.  It  has  already  been  reported  that  an 
inherent  property  of  the  Complex-BP  is  its  ability  to  learn  “2D  motion”  [1,  3].  And  also, 
the  Complex-BP  has  been  applied  to  the  interpretation  of  optical  flow  (motion  vector  field 
calculated  from  images)  and  estimation  of  motion  which  are  important  tasks  in  computer 
vision  [5,  6]. 

The  “3DV-BP”  is  a  three-dimensional  vector  version  of  the  back-propagation  algo¬ 
rithm  which  can  be  applied  to  multi-layered  neural  networks  whose  threshold  values, 
input  and  output  signals  are  all  3D  real  valued  vectors,  and  whose  weights  are  all  3D  or¬ 
thogonal  matrices  (2J.  This  algorithm  is  a  natural  extention  of  the  Complex-BP  algorithm. 
This  paper  shows  in  computational  experiments  that  the  3DV-BP  has  the  ability  to  learn 
“3D  motion”,  which  corresponds  to  the  ability  of  the  Complex-BP  to  learn  “2D  motion”. 

Hereafter,  we  shall  refer  to  a  real  valued  (usual)  neural  network  used  by  the  Real-BP 
as  a  “Real-BP  network”,  a  complex  valued  neural  network  used  by  the  Complex-BP  as 
a  “Complex-BP  network” ,  and  a  three-dimensional  vector  valued  neural  network  used  by 
the  3DV-BP  as  a  “3DV-BP  network”. 

This  paper  is  organized  as  follows:  Section  2  describes  the  3DV-BP  algorithm,  and 
Section  3  deals  with  the  empirical  analyses  of  the  ability  of  the  3DV-BP  network  model 
to  learn  3D  motion  .  The  paper  will  end  with  our  conclusions. 

2  The  “3DV-BP”  Algorithm 

This  section  briefly  describes  the  3DV-BP  algorithm  [2j.  It  can  be  applied  to  multi-layered 
neural  networks  in  which  threshold  values,  input  and  output  signals  are  all  3D  real  valued 
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vectors,  and  whose  weights  we  ail  3D  orthogonal  matrices,  and  the  output  function  F  of 
a  neuron  can  be  defined  as 

'  /(«i)  r  oi  ■ 

=  /(aa)  ,  >l  =  aj  ,  (1) 

.  /(as)  J  L  as  . 

where  /(«)  =  1/(1  +exp[— «]),  that  is,  each  component  of  an  output  F{A)  of  a  neuron 
means  the  sigmoid  function  of  each  component  of  the  net  input  A  to  the  neuron, 
respectively  (m  =  1, 2, 3).  The  learning  rule  has  been  obtained  by  using  a  steepest  descent 
method. 

3  Ability  to  Learn  3D  Motion 

We  wiil  now  present  some  illustrative  examples  to  show  that  an  adaptive  network  of  3D 
valued  neurons  can  be  used  to  learn  3D  motion  such  as  rotation,  similar  transformation, 
and  translation.  Due  to  space  limitations,  we  will  restrict  the  presentation  of  our  results 
to  similar  transformations,  although  similar  work  has  been  carried  out  on  rotations,  and 
parallel  displacement  [4]. 

We  used  a  1-6-1  three-layered  network,  which  transformed  a  point  (zt,  Zj,  Z3)  into  an¬ 
other  point  (z'j,Z2,  Z3)  in  3-dimensional  space.  Although  the  3DV-BP  network  generates 
a  value  X  =  ‘(zi  Z2  Z3]  within  the  range  0  <  zi,.C2,Z3  <  1,  for  the  sake  of  convenience, 
we  present  it  in  the  figures  given  below  as  having  a  transformed  value  within  the  range 

—  1  <  Zi,Z2,Z3  <1. 

We  also  conducted  experiments  with  a  3-15-3  network  with  real  valued  weights  and 
thresholds,  to  compare  the  3DV-BP  with  the  Real-BP  The  first  component  of  a  3-vector 
was  input  into  the  first  input  neuron,  the  second  component  was  input  into  the  second 
input  neuron,  and  the  third  component  was  input  into  the  third  input  neuron.  The  output 
from  the  first  output  neuron  was  interpreted  as  the  first  component  of  a  3-vector,  and  the 
output  from  the  second  output  neuron  was  interpreted  as  the  second  component,  and  the 
output  from  the  third  output  neuron  was  interpreted  as  the  third  component. 

The  learning  constant  e  used  in  these  experiments  was  0.5.  The  initial  components 
of  the  weights  and  the  thresholds  were  chosen  to  be  random  real  numbers  between  —  0.3 
and  0.3.  We  determined  that  learning  finished  when 


=  0.05  (2) 

held,  where  ||x||  ^  yfx]  -I-  z^  -f  Z3,  x  =  ‘[zi  x-i  Z3];  denote  the  desired 

output  value,  the  actual  output  value  of  the  neuron  k  for  the  pattern  p,  i.e.  the  left  side 
of  (2)  means  the  error  between  the  desired  and  actual  output  patterns  (R  denotes  the 
set  of  real  numbers);  N  denotes  the  number  of  neurons  in  the  output  layer.  We  regarded 
presenting  a  set  of  learning  patterns  to  the  neural  network  as  one  learning  cycle.  In  this 
connection,  time  complexity  per  learning  cycle  of  the  1-6-1  three-layered  network  for  the 
3DV-BP  was  nearly  equal  to  that  of  the  3-15-3  three-layered  network  for  the  standard  BP, 
as  seen  in  Table  1.  Furthermore,  the  space  complexity  (i.e.  the  number  of  parameters) 
was  almost  half  that  of  the  standard  BP. 

The  experiments  described  in  this  section,  consisted  of  two  parts  -  a  training  step, 
followed  by  a  test  step. 
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3.1  Learning  of  a  Simple  3D  Motion 

Figs.  1  and  2  show  the  result  of  an  experiment  on  a  simple  similar  transformation. 

The  training  step  consisted  of  learning  a  set  of  (3D  orthogonal  matrix  valued)  weights 
and  (3D  vector  valued)  thresholds,  such  that  the  input  set  of  11  points  (with  equal  inter¬ 
vals)  lying  along  the  straight  line 


‘  X  ' 

■  1  ■ 

y 

z 

=  t 

1 

1 

(0.0  <  t  <  1.0) 


(3) 


gave  as  output,  half-scaled  straight  line  points  (Fig.  1).  The  output  training  points  also 
lay  along  the  same  straight  line  (^nation  (3)),  but  the  range  was  0.0  <  <  <  0.5.  To  avoid 
complexity,  we  omitted  the  points  and  showed  only  the  lines  joining  them  in  the  figures. 

In  a  second  (test)  step,  the  48  input  points  (with  equal  intervals)  lying  on  three  squares 
would  hopefully  be  mapped  to  an  output  set  of  points  lying  on  three  half-scaled  squares. 
The  actual  output  test  points  for  the  3DV-BP  did,  indeed,  almost  lie  on  the  squares  (but, 
with  an  error)  (Fig.  2). 

To  compare  how  a  real  valued  network  would  perform,  the  3-15-3  (real  valued)  network 
mentioned  above  was  trained  using  the  same  pairs  of  training  points  lying  along  equation 
(3).  The  same  48  test  points  lying  on  the  three  squares  were  then  input  with  this  real 
network.  All  points  were  “mapped”  onto  straight  lines,  as  shown  in  Fig.  2. 


3.2  Learning  of  a  More  Complex  3D  Motion 

This  subsection  shows  that  the  3DV-BP  can  make  more  complicated  transformation. 

Fig.  3  shows  how  the  training  points  mapped  onto  each  other.  Those  1 1  points  (with 
equal  intervals)  lying  along  the  straight  line  indicated  by  “Input  Pattern  1”,  mapped  onto 
points  along  the  same  line,  but  with  a  scale  reduction  factor  of  2.  Those  11  points  (with 
equal  intervals)  lying  along  the  straight  line  indicated  by  “Input  Pattern  2”  mapped  onto 
points  along  the  same  line,  but  with  a  scale  reduction  factor  of  10.  All  the  training  points 
lie  along  the  straight  line 


'  X  ' 

■  1  ■ 

y 

z  _ 

=  t 

1 

1 

(-1.0  <  t  <  1.0), 


(4) 


where  “Input  Pattern  1”  for  0.0  <  f  <  1.0,  “Output  Pattern  1”  for  0.0  <  t  <  0.5,  “Input 
Pattern  2”  for  —1.0  <  t  <  0.0,  and  “Output  Pattern  2”  for  —0.1  <  t  <  0.0. 

In  the  test  step,  by  presenting  the  60  points  lying  on  the  three  circles  x'^  +  =  1, 

z"*  =  \  and  a;'  -|-  y*  =  1,  the  actual  output  points  took  the  patterns  as  shown  in  Fig. 
4. 

It  appears  that  this  3DV-BP  network  has  learned  to  generalize  the  reduction  factor 
or  as  a  function  of  the  position  in  three-dimensional  space,  i.e.  a  point  ‘[i  y  z]  is 
transformed  into  another  point  a*[x  y  z],  where  o-(‘[z  y  z\)  «  0.5  for  x,y,z  >  0,  and 
a(‘(i  y  z\)  m  0.1  for  x,y,z  <  0. 


4  Conclusions 

We  investigated  by  computational  experiments  the  characteristics  of  the  3DV-BP  algo¬ 
rithm  which  is  a  natural  extension  of  the  Complex- BP  algorithm.  It  was  learned  that 
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the  3DV-BP  had  the  ability  to  learn  “30  motion”  such  as  similar  transformation  as  its 
inherent  property,  which  corresponded  to  the  ability  of  the  Complex-BP  to  learn  “2D 
motion”.  We  expect  that  applications  for  the  3DV-BP  algorithm  will  be  found  in  such 
areas  as  3D  image  processing. 
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Network 

Time  complexity 

Space  complexity 

X  and  -r 

-k  and  — 

Sum 

Weights 

Thresholds 

Sum 

3DV-BP  1-6-1 

255 

141 

396 

36 

21 

57 

Standard  BP  3-15-3 

264 

141 

405 

90 

18 

108 

Table  1  The  Computational  Complexity  of  the  3DV-BP  and  the  Stan¬ 
dard  BP.  Time  complexity  means  the  sum  of  the  four  operations  per¬ 
formed  per  learning  cycle.  Space  complexity  means  the  sum  of  the  pa¬ 
rameters  (weights  and  thresholds). 
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Abstract 

SimuIatioD  studies  coufinn  tbe  previously  observed  ability  of  recurrent  neural  networks  to  <.  .  finite 

automala  in  a  stable  manner  while  processing  atbitrarily  long  iiqHit  sequences .  First  order  recurrent  networks  with 
two  difiemit  architectures  have  been  trained  to  emulate  given  automata  and  untrained  networics  with  randomly 
assigned  weights  can  often  qxntaneouslyenuilale  automata.  This  behavior  is  tied  to  the  existence  regions  of  the 
network  state  q»ce  which  correspond  to  the  states  of  the  automata  and  are  m^tped  into  each  other  by  tbe  hquits  to 
the  networic.  Networis  initialized  at  arWtraty  points  in  the  state  spa(x  and  supfdied  with  random  iiquit  sequences 
oner  these  regions  in  a  small  numbo^of  time  steps  and  then  remain  in  them  indefinitely.  This  automaton  structure 
can  be  thought  as  a  generalization  of  the  limit  cycle  attractor  for  systnns  with  varying  irqwts. 

Badcground 

Researchers  qiplying  recurrent  neural  netwtnks  to  die  problem  of  learning  regular  grammars  have 
consisteady  found  that  they  accomplish  this  task  by  configuring  themselves  to  approximately  the  finite 

autmnata  associated  with  diese  grammars  (Cteeremans,  Servan-Schreiber,  &  McClelland,  1989;  Giles,  Miller, 
Chen,  Chen,  Sun,  A  Lee,  1992;  Omlin,  Giles,  A  Miller.  1992  and  others).  The  points  in  the  network  slate  qiace 
viated  while  processing  strinp  of  characters  tend  to  duster,  wtth  the  clusters  corresponding  to  the  stales  of  the 
automata.  Since  there  are  errors  in  the  output  a  recurrent  neural  network  at  each  time  stq>,  it  has  been  generally 
assumed  that  these  errors  should  aocumulaie  with  dme.  produdng  a  progressive  degradation  in  pnformance.  Sudi 
d^radadon  has  been  observed  when  networks  have  been  tested  on  strings  much  longer  that  those  on  which  they 
were  trained.  Tbe  observed  clusters  grow,  become  less  well  separated  and  eventuaUy  overlap  and  lose  their 
identity.  An  excqttioo  to  this  was  observed  by  Manolios  A  Fandli  (1993).  Some  recurrent  networks  trained  on 
relatively  short  strings  maintained  their  paformance  and  retained  well  defined  clusters  that  did  not  grow  even 
uAea  procesdng  strings  up  to  1,000  characias  in  length.  Zeng,  Goodman,  A  Smyth  (1993)seem  to  have  observed 
siniilar  behavior  in  one  rtf  their  networks.  Manolios  A  Fanelli  (1993)  concluded  that  tbe  finite  automaton  like 
behavior  of  tbe  networic  had  been  somehow  encoded  in  its  attractor  structure  (when  considered  as  a  dynamical 
system)  so  that  it  could  be  retained  indefinitely,  without  degradation.  This  ptqier  reports  on  fintber  invesiigatiotts  of 
this  pbeiKNnenon  and  its  nature  as  tbe  generalization  of  a  limit  cycle  attractor.  A  simulation  study  such  as  this  is 
empirical  and  thus  cannot  be  d^initive  or  all  encompassing.  However,  it  can  expose  some  of  the  features  of  the 
phenommion  and  thus  provide  a  baas  for  theoretical  analyses. 

Introductioii 

Tbe  study  has  investigated  discreie-time.  first  orda,  recurmt  neural  netwmks  with  two  architectures,  tbe 
sinqdeteconent  network  of  Elman  (1990)  and  an  architecture  with  two  recurrent  h^ers,  feeding  back  to  each  other 
(Mantdios  A  Fandli.  1993).  The  inputs  to  tbe  networks  wrne  restricted  to  be  members  of  finite  sets  (  e.g. 
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mnpmamtatitm  rt  Hm»  rhanntKrs  in  »  rhmrtfr  >,  hiit  rmiM  nthftrwiM*  vary  »ihili»rily  wilh  Hum  Systems  With  DO 
iofmis  or  constant  ones  normally  have  well  defined  equUibrium  points,  limit  cycles  or  chaotic  adradors.  Hiisch 
(It^)  has  briefly  discussed  systems  with  variid^  iiqwts,  suggesting  that  in  some  cases  such  a  system  can  aa  like 
”a  rather  uwdiahle  finite  state  mitomaton."  Each  different  input  can  be  thought  of  as  defining  a  different  mapping 
of  the  system  state  fiom  one  time  step  to  the  next,  thus  d^nii^  a  different  dynamics  for  the  system.  Rqtetition 
each  iiqMt  tlnis  results  in  an  approach  to  a  distinct  set  of  attractors.  Also,  any  finite  sequence  of  inputs  also 
sped&s  a  dynamics,  but  with  a  larger  effective  time  step.  Its  own  attractors  will  be  approfKhed  after  mtough 
rqtetitioos.  Hiere  is  thus  an  infinite  number  of  sets  of  attractors  associated  with  a  network  and  its  irqHits.  In  this 
study,  a  few  of  these  attractors  have  been  observed  and  points  on  them  found  to  be  closely  associated  with  staUe 
clusters  of  network  states  corresponding  to  automaton  states. 

Mcdiodsiised 

Automaton-like  behavior  can  be  identified  in  the  following  wt^.  As  a  netwoflc  processes  a  sequmice  of  inputs, 
clusters  of  network  states  corresponding  to  automaton  states  can  remain  staUe  indefinitely,  provided  that  the  points 
in  these  dusters  lie  in  regions  with  the  foUowing  property:  In  one  time  sl^,  a  point  in  such  a  region  is  mapped  by 
any  member  rtf  the  iiqNit  set  into  a  point  that  lies  within  the  same  or  another  such  r^ion.  A  subtractive  algorithm 
for  idmitifying  stable  regions,  based  on  this  prr^ierty,  has  been  developed.  Rrst,  the  clusters  of  stales  observed 
when  a  network  processes  long  random  strings  are  used  as  a  guide  for  choonnginhialri^ions  which  include  them. 
The  remainder  of  the  state  space  is  denoted  the  exdnded  region.  Then  a  fixed  rectangular  grid  is  imposed  on  the 
state  qiace.  Eadi  included  point  on  a  grid  intersection  is  nugsped  by  each  member  the  input  set  to  a  resulting 
poinL  If  any  of  the  resulting  points  lie  in  the  excluded  regioo,  the  grid  point  is  reassigned  to  the  excluded  region. 
This  continues  until  no  more  grid  points  can  be  excluded.  The  boundaries  of  the  resulting  regions  are  uncertain 
because  of  the  finite  grid  size.  The  algorithm  deals  conservatively  with  this  uncertainty  by  excluding  a  grid  point 
unless  all  (rf  its  mapped  points  are  entuely  surrounded  by  included  grid  points.  Finally,  poinu  whose  mappings  lie 
in  die  stable  regions  but  wfaidi  are  not  tiiemselves  close  to  any  mapped  points  are  considered  not  to  be  part  of  the 
asynqitotic  regions  and  are  removed.  The  remaining  points  indicaie  the  StaUe  r^ions.  The  results  of  the  algorithm 
may  depend  on  the  initial  cboioe  of  the  regions,  so  the  algorithm  can  be  repeated  with  different  choices  to  check 
consistency. 

Indiispq)er,8aniide  results  for  three  neural  networks  are  shown.  In  each  case,  the  network  has  a  single  irqwt 
unit,  set  to  zero  or  one,  so  that  over  time  the  irqmts  are  bit  strirtgs.  The  nmnber  of  units  represmiting  die  states  the 
networks  was  held  to  two,  so  that  the  state  spaces  were  two  diinensional  and  could  be  easily  di^byed.  The  first 
network  had  an  Ehnan  architecture,  widi  two  hidden  units  rqiresmiling  the  state.  Tbe  second  had  the  2-byer  ardii- 
tectme  with  two  state  units,  one  designated  for  output  These  networks  were  trained  with  versions  of  the  RTRL 
algorithm  (Williams  A  Zipser,  1989)  on  sim|de  bit  string  grammars.  The  third  was  another  2-bym’  network, 
randomly  initialized  and  untrained.  Each  network  was  stndied  b  the  sanre  way.  All  the  networks  were  tested  with  a 
randomly  sdecied  state  at  the  start  of  the  processing  of  each  string.  Rrst  the  network  was  tested  with  3  random 
strings  of  1000  0's  and  I's  and  the  last  100  stales  in  each  case  were  observed.  The  clustering  of  these  states  served 
as  a  guide  for  the  algorithm  for  finding  staUe  n^ions.  Second,  this  algorithm  was  qiplied.  Third,  the  networks 
were  tested  with  9  strings  of  100  0's  then  9  strbgs  100  Ts  and  then  9  strings  of  SO  01  pairs.  These  tests  located 
the  attractors  for  single  chmactets  and  two  diaracter  sequences,  which  appeared  wdl  in  advance  of  KX)  time  stqis. 
Fourth,  convergence  of  the  networks  was  tested  with  100  strings  of  10  random  characters,  which  turned  out  to  be 
long  enoi^  to  reach  the  asymptotic  automaton  regioo.  With  random  initial  states,  this  test  indicated  whether  the 
asynqMotk  automaton  was  reached  globally  or  fiom  some  basin  of  attraction. 
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Elman  Architecture 

Thenetwoffcwas  tiaiiie<lastnFaiieUi(1993)  on  a  simple  bit  string  grammar  which  was  found  easy  to  learn 
for  tfab  architecture.  The  graounatical  strings  are  just  those  that  end  in  the  sequence  "01".  The  training  set  consisted 
of  all  strings  tess  than  or  equal  to  4  characters  in  length,  including  the  null  string.  Forty  nets  were  initialized 
taDdomly  and  trained  for  10,000  epochs  each  with  a  learning  parameter  of  0.01  and  a  momentum  term  of  0.  The 
initial  states  for  the  nets  were  chosen  randomly  during  the  initialization,  then  held  flxed  during  training.  The 
netwoik  with  the  smallest  root-memr-square  (rms)  error  ( 0.016  )  was  further  trained  to  a  total  of  350,000  epochs. 
Its  final  rms  error  was  0.001 1 .  This  network  was  then  studied  as  described  above. 

Figure  1  shows  the  stable  regions  found  for  this  networic,  which  correspond  to  the  three  states  of  the  minimal 
autnnaton  for  the  grammar.  The  rectangular  shape  of  the  regions  is  purdy  a  consequence  of  the  coarseness  of  the 
gridusedinthiscase.  The  attractors  for  single  characters  and  2  diaracter  sequences  are  also  shown  and  lie  within 
the  stable  regions.  The  long  random  string  tests  described  above  yielded  clusters  of  states  that  all  lay  within  the 
stable  regions  also.  These  are  not  shown.  In  the  convergence  test,  the  network  states  reached  stable  regions  within 
10  time  steps  finnn  all  100  initial  states,  indicating  global  convergence. 

Two-Layer  Ardiitecture,  Tondta  6  grammar 

The  netwoik  was  trained  on  the  sixth  granmar  of  Tomita  (1982).  The  training  set  used  initially  consisted  of 
all  strings  less  than  or  equal  to  4  characters  in  length,  including  the  null  string.  The  initial  states  for  tiie  nets  were 
chosen  as  described  above  for  the  Ebnan  nets.  To  begin,  182  nets  were  initialized  randomly  and  trained  for  1000 
qiochs  each,  with  a  learning  parameter  of  0.1  and  a  momentum  term  of  0.5.  Of  these  nets,  5  had  root-mean-square 
errors  less  than  0.1.  The  best,  with  an  rms  error  of  0.042  was  chosen  for  further  training.  It  was  trained  for  another 
1(X)0  qxxdis  on  the  same  training  set,  reaching  an  rms  error  of 0.027.  It  was  then  tested  on  each  of  3  single  strings 
consisting  1(X)0  random  O's  and  I's  (different  than  the  strings  described  above).  Its  largest  rms  error  was  0.031. 
The  netwoik  was  trained  for  another  3000  epochs  on  the  string  producing  this  error,  reaching  a  final  rms  error  of 
0.010. 

Hgure  2  shows  the  stable  regions  and  1  and  2  character  attractors  for  this  network.  There  are  four  stable 
regions  corresponding  to  an  automaton  that  is  reducible  to  the  three  state  minimal  automaton  for  Tomita  6.  The 
attractors  are  all  limit  cycles  with  two  3  state  cycles  for  single  characters  and  three  2  state  cycles  for  two  character 
sequences,  all  expected  from  the  grammar  itself.  The  stable  regions  are  larger  than  in  the  previous  case,  with  the 
lower  two  breaking  iq>  into  sub-r^ions.  As  above,  the  la^  1(X)  states  of  long  strings  all  lay  well  within  the  regions 
shown.  Global  convergence  to  the  stable  regions  within  10  time  steps  was  observed  for  this  netwoik  also. 

Two-Layer  Ardiitecture,  untrained 

Networks  were  randomly  initialized  in  the  same  mmmer  as  the  other  nets,  but  were  not  trained.  The  weight 
values  were  uniformly  distributed  between  -2.0  and  +2.0.  The  same  studies  were  performed  as  on  the  other  nets. 

Hgure  3  shows  results  for  one  such  network.  They  are  similar  to  those  already  (xesmiied,  even  though  the 
networics  woe  not  traiiied.  Once  again,  later  states  of  long  random  strings  clustered  within  the  stable  regions  and 
global  convergence  within  10  time  stqts  was  observed. 

IMscusskni 

The  above  results  indicate  that  stable  automaton-like  behavior  is  often  a  feature  of  recurrent  neural  networks 
running  for  large  numbos  of  time  steps.  Untrained  nets  can  display  it  spontaneously  and  nets  can  be  trained  to 
qiproxiinaie  particular  automata  with  nrots  remaining  within  fixed  tolerances.  It  is  likely  that  this  behavku'  is  a 
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feature  of  otiier  systems  also,  sncb  as  seoood  Older  oeunlnetwofksaiKl  other  Idiids  of  discfete-iinae  dynamical  sys¬ 
tems.  It  may  be  couskfered  as  a  natural  genaalizadon  of  the  limit  cycle  to  systems  with  disoete  time-varying 
inputs. 

The  automata  observed  appear  to  have  a  finite  numba  of  stales  at  the  accuracy  scale  of  this  study.  However, 
the  sub-reghms  observed  for  the  second  network  and  other  obsovadons  suggest  thm  as  the  grid  size  (or  other  scale 
measure)  becomes  finer,  breakup  into  still  smalltf  regions  may  be  observed.  This  raises  the  possibility  thm  in  the 
limit  these  regions  approach  a  possibly  unbounded  number  of  distinct  points.  Definitive  resolution  of  this  question 
goes  beyond  shnulaikm  studies  and  requites  a  theoretical  analysis. 
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State  Unit  2 


State  Unit  1 


Figure  1.  Stable  fegions  and  three  attractors  for  an  Elman  network  trained  on  a  simple 
bit  string  grammar.  Data  point  symbols  are  scaled  to  approximately  indicate  the  grid  size 
used  in  applying  the  stabte  region  finding  algorithm.  The  single  character  attractors  are 
both  equilibrium  points  while  the  two  charai^  attractor  is  a  two  state  limit  cycle.Ilie 
dotted  line  traces  the  cycle  and  the  repeated  characto-  sequences  associated  witti  the  attrac¬ 
tors  are  indicated  with  arrows.  The  region  at  the  lower  right  corresponds  to  the  start  state 
and  that  at  the  lower  left  to  die  single  final  state. 
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Figure  2.  Stable  legicMis  and  simple  attractors  for  a  2-layer  netwoik  trained  on  the  Tomita 
6  bit  string  gimnmar.  Data  point  symbols  ate  scaled  to  approximately  in^cate  the  grid  sue 
used  in  replying  the  stable  region  finding  algcxidim.  The  two  single  character  attractors  are 
three  state  limit  cycles  while  ^  three  two  charaetCT  attractms  are  two  state  limit  cycles.  The 
dotted  lines  trace  the  cycles  and  die  repeated  character  sequences  associated  with  tfiem  are 
indicated  with  arrows.  The  "0"  cycle  is  traversed  counterclockwise  and  the  T  cycle  clock¬ 
wise.  The  automaton  is  reducible,  with  the  two  upper  states  reducing  to  the  start  state  which 
is  also  the  sin^  final  state. 
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Figure  3.  Stable  regions  and  simple  attractors  for  a  2-layer  netwoik  randomly  initialized, 
but  untrained.  Data  point  symbols  are  scaled  to  approximately  indicate  die  grid  size  used  in 
flying  the  stable  region  finding  algorithm.  The  two  single  character  attractors  are  ^ui- 
libnum  points  while  die  two  char^ter  attractor  is  a  two  state  limit  cycle.  The  dotted  lines 
trace  the  cycles  and  the  repeated  character  sequences  associated  with  ttem  are  indicated 
with  arrows. 
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ABSIRACT 

A  novel  artificial  neural  network  architecture  is  introduced.  The  network  employs  the  Hausdorff  distance  as  a 
metric  of  siinilarity  between  two-dimensional  patterns.  The  Hausdorff  distance  exhibits  many  properties  that  are 
desinUe  in  a  pattern  recognition  context,  and  the  itetwork  inherits  these  advantageous  traits.  The  architecture, 
learning  rule,  and  recall  processmg  for  the  network  ate  presented.  It  is  shown  that  the  network  internally  employs 
a  verskm  of  die  Voronoi  surface  to  fiuilitate  processing.  Preliminary  results  from  applying  the  network  to  pattern 
recognition  tasks  ate  encouraging. 

INTRODUCTION 

This  paper  introduces  a  new  neural  network  architecture  specifically  designed  for  two-dimensional  pattern 
recognition.  Because  die  network  utilizes  the  Hausdorff  distance  as  a  metric  of  similarity  between  patterns,  and 
because  it  employs  a  learned  version  of  the  Voronoi  surfiKe  to  perform  the  comparison,  it  has  been  dubbed  the 
HAusdorff-Vorrmoi  NETwork  or  HAVNET.  The  choice  of  the  Hausdorff  distance  as  the  metric  of  similarity 
between  input  pattmis  and  learned  patterns  is  what  makes  HAVNET  different  from  most  other  neural  network 
paradigms. 

Some  cuneut  neural  networks  treat  the  input  as  a  vector  and  the  trained  weights  as  another  vector,  and  the  measure 
of  siniilarity  becomes  the  difference  berivsen  these  two  vectors  [e.g.  1].  The  node  with  an  internal  weight  vector 
that  most  closely  matches  the  presented  vector  generates  the  highest  output  ttspoast.  Unfortunately,  transforming 
a  two-dimensional  input  pattern  into  a  muitidimenstorMl  vector  can  produce  behavior  that  is  counter-intuitive.  Input 
pattons  dutt  appear  very  sunilar  (to  the  human  eye)  to  learned  patterns  for  a  particular  node  can  generate  very  poor 
responses  fiom  diat  no^. 

Other  networks  treat  two^liineiisional  input  patterns  as  a  matrix,  and  the  measure  of  comparison  between  the  input 
matrix  and  the  stored  (learned)  matrix  is  computed  as  a  correlation  between  the  two  patterns  [e.g.  2].  These 
networks  essentially  perform  template  matching.  Although  these  networks  perform  well  when  input  patterns  are 
identical  to  learned  patterns,  ali^t  distortions  in  die  input  patterns  can  drastic^y  reduce  the  ou^nit  from  the  trained 
node,  again  producing  results  diat  do  not  agree  well  with  human  interpretations  of  pattern  similarity. 

One  measure  of  similarity  between  two-dimmsiooal  binary  patterns  that  has  been  shown  to  agree  closely  with  human 
performance  is  the  Hmsdotff  Stance  [3].  The  Hausdorff  distance  measures  the  extent  to  which  each  point  of  an 
input  set  lies  near  some  point  of  a  model  set.  Given  two  finite  point  sets  A={a,,...,aJ  and  the 

Hausdorff  distance  is  defined  as: 

H(A, B)  *  maxth (A, B)  ,h(B, A)}  (1) 

Where  the  function  h(A,B)  corqiutes  die  directed  HuisdorfF  distance  from  A  to  fi  as  follows: 

b^A.B)  .  5S(S|g{|a-fc|}}  (2) 

Where  B  a-h  y  is  typically  the  Euclidean  distance  between  points  a  and  b.  The  directed  Hausdorff  distance 
identifies  diat  point  in  A  that  is  furthest  from  any  point  in  B  and  measures  the  distance  from  that  point  to  its  nearest 
neighbor  in  B.  If  h(A,B)^d,  all  points  in  A  are  within  distance  d  of  some  point  in  B.  The  (undirected)  Hausdorff 
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distance,  Ihoi,  is  the  meximuni  of  the  two  diiected  distances  between  two  point  sets  A  and  B  so  that  if  the  Hausdorff 
distance  is  d,  then  all  points  of  set  are  within  distance  d  of  some  point  in  set  B  and  vice  versa. 

The  Hausdorff  exhibits  many  desirable  properties  for  pattern  recognition.  First,  it  is  known  to  be  a  metric 

over  the  set  of  all  closed,  bounded  sets  [4].  Also,  it  is  everywhere  non-negative  and  it  obeys  the  properties  of 
identity,  symmetry,  and  triangle  inequality.  In  the  context  of  pattern  recognition  this  means  that  a  shape  is  identical 
only  to  itself,  duU  the  order  of  comparison  of  two  shapes  does  not  matter,  and  that  if  two  shapes  are  highly 
diammilar  th^  camiot  both  be  similar  to  some  diird  shape.  This  final  property  (triangle  inequality)  is  particularly 
inqiortant  Ibr  reliable  pattern  classification.  It  was  because  of  these  advantageous  properties  that  the  Hausdorff 
dirtance  was  chosen  as  the  similarity  metric  that  is  the  basis  of  HAVNET.  The  architecture,  learning  rule,  and 
recognition  process  used  in  HAVNCT  are  described  in  detail  in  the  following  sections. 

NEURAL  NETWORK  ARCUTTECl'URE 

An  overview  of  die  architecture  of  HAVNET  is  shown  in  Figure  1.  The  neural  network  behaves  as  a  binary  pattern 
classifier.  It  takes  as  inputs  two-dimensional  binary  pattmns,  it  employs  feed-forward  processing,  and  it  produces 
analog  output  patterns.  One  ouquit  is  generated  by  each  node,  with  the  analog  value  indicating  the  level  of  match 
between  die  input  pattern  and  the  class  represented  by  that 
node.  The  neural  network  consists  of  three  layers,  the 
plastic  layer^  the  Voronoi  foyer,  and  the  Hausdoiff  layer. 

The  plastic  layer  contains  neurons  with  the  weights  that  are 
traini^  during  the  learning  process,  the  Voronoi  layer 
serves  to  measure  the  distance  between  individual  points  in 
the  iiqmtand  learned  patterns,  and  die  Hausdorff  layer  uses 
information  from  die  Voronoi  layer  to  compute  the  overall 
levd  of  similarity  between  the  input  pattern  and  the  learned 
pattern.  A  detailed  diagram  of  the  uchitecture  for  a  single 
node  is  shown  in  Figure  2.  The  node  is  shown  in  a 
configuration  for  one-dimenskmal  inputs  for  reasons  of 
clarity,  hi  die  actual  network,  die  input  pattern,  plastic 
layer,  and  Voronoi  layer  are  all  two-dimensiooal. 

T  kerning  is  employed  in  HAVNET  to  adapt  the  individual  _ 

nodes  to  recognim  certain  classes,  and  it  is  conducted  by  Figure  1.  Neural  Network 

presenting  examples  of  each  class  to  the  network  during  a  Architecture  Overview 

training  phase.  l  Mming  fo  die  particular  implementation 
of  die  network  described  here  is  done  off-l^  and  in  a 

siqiervised  manner,  but  it  could  alternatively  enqiloy  on-line  learning  and  self-organization.  The  specific  details  of 
bow  learning  and  recognition  are  carried  out  are  presented  in  the  following  sections. 

NEURAL  NETWORK  LEARNING 

Off-line  supervised  learning  is  implemented  in  the  version  of  HAVNET  described  here.  The  network  is  trained  to 
recognize  objects  off-line,  and  the  network  is  informed  a  priori  of  the  class  to  which  each  training  pattern  belongs. 
Once  die  nerivork  is  put  into  the  recognition  (on-line)  mode,  training  ceases  and  recognition  response  is  repeatable 
and  predictaUe. 

The  wei^t  matrices  that  undergo  changes  during  the  learning  process  reside  in  the  plastic  layer  of  the  network  (the 
reader  may  wish  to  refer  to  Figures  1  and  2  throughout  this  discussion).  A  biiury  input  pattern  A"  that  is  presented 
to  die  network  during  the  learning  phase  is  represented  as  follows: 
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(1.0) 


x=l . .  .X,  y=l .  .  .Y 


(3) 


When:  X,Y  the  dimensions  of  the  input  pattern 


Prior  to  lewning.  each  wei{^t  niattu  W  of  Uie  network  is  initialized  as  follows: 

<y*«)  n~l...N  (4) 

w-o”  =  0  (5) 

Where:  N  the  total  number  of  nodes  in  the  network 

The  quantity  i  is  defined  as  the  span  of  the  Voronoi  layer,  and  its  meaning  will  be  elaborated  upon  later.  At  this 
point  it  is  necessary  oidy  to  state  that  it  is  a  poritive  integer  value  that  is  much  less  than  toe  dimensions  of  toe  iiqHit 
patters.  The  weigh!  '*'0"  defined  as  toe  averaging  weight  for  a  node,  and  it  is  trained  during  each  training  pass 
r^ardless  of  toe  input  pattern.  The  Wq  weights  serve  as  an  indicator  of  the  extent  to  which  each  node  has  received 
training. 

When  node  n  is  trained  on  input  pattern  m,  the  change  in  each  of  toe  weights  is  conq>uted  as  follows: 

A .  (y**J  *  ^x.y  «  ( 1  .  (y*«)  )  ^  ®  ^ 

Awb®  =  a  (1-0  O) 

Where:  0  H  a  1  is  the  learning  rate 
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Once  the  weight  change  is  conqmted,  the  weights  are  updated  as  follows: 

*^x*V)V(y+«)  “  *^x**),  (8) 

^U*X)  ^  (9, 

Where:  t  »  training  iterations 

During  die  leaniing  phase,  each  training  pattern  is  presented  to  the  network  in  sequence,  and  the  appropriate  node 
is  trained  using  the  equations  above.  The  learning  rate  determines  the  magnitude  of  the  effect  that  each  training 
exemplar  has  on  the  trained  weights.  The  saturatirm-like  behavior  of  the  learning  rule  (see  Equation  4)  guarantees 
that  die  learning  process  will  reliably  converge  for  any  finite  set  of  training  patterns. 

NEURAL  NETWORK  RECOGNITION 

In  the  recognidon  mode,  the  HAVNET  atten^its  to  classify  an  arbitrary  input  pattern  into  one  of  the  classes 
represented  by  the  trained  network  nodes.  During  recognition,  the  neural  network  computes  a  modified  version  of 
the  directed  Hausdorff  distance  between  an  input  pattern  and  a  stored  pattern  at  a  given  node.  To  clarify  the 
explaiution  of  this  conqmtation,  the  concept  of  a  truncated 
Voronoi  sutface  is  introduced.  A  Voronoi  surface  is 
constructed  for  a  two-dimensional  set  of  pomts  A  by  first  to 

locating  the  monbers  of  A  in  the  x-y  plane,  and  then  «* 

plotting  in  the  third  (z)  dimenrion  the  distance  from  any  » 

point  in  the  x-y  plane  to  the  nearest  point  that  is  a  member  “ 

of  A  [5].  When  this  distance  is  not  allowed  to  exceed  some 
value  S,  then  the  surfoce  is  defined  as  a  truncated  Voronoi 
surfoce.  The  plot  of  the  truncated  Voronoi  surface  is  * 

sometinies  referred  to  as  an  egg-carton  plot  because,  if  the  ' 

nmnbRs  of  A  form  a  rectangular  grid,  the  resulting  plot  * 

resembles  an  egg  cartoQ  (3].  Figure  3  shows  a  set  of  10  ^ 

randomly  located  points,  and  Figure  4  shows  the  truncated 
Voronoi  surface  for  that  set,  with  the  maximum  distance  at  -  —  ■■■ 

J»«3.  Note  diat  cone  shaped  depressions  are  formed  in  the  Figure  3.  Random  Point  Set 

surfece  at  the  locations  of  tiie  original  points. 

The  Voronoi  surface  can  be  used  to  conveniently  compute  the  directed  Hausdorff  distance  between  two  point  sets. 
In  order  to  perform  the  conqHitation  between  a  point  set  B  and  the  previously  defined  point  set  A,  the  members  of 
B  are  simply  located  in  the  x-y  plane,  and  the  z-value  above  each  point  represents  the  distance  from  that  point  in 
B  to  its  nearest  neighbor  in  A.  The  maximum  of  these 
values  is  the  directed  Hausdorff  distance  h(B^).  For 
neural  network  purposes  it  is  desirable  to  compute  the 
inverse  of  this  distance,  so  that  shorter  distances  result  in 
higher  ou^mts.  It  is  also  desirable  to  threshold  this 
distance  at  some  maximum,  so  that  any  distance  above  that 
maximum  will  generate  the  minimum  network  ou^t  (zero 
in  tiiis  case).  For  these  reasons  it  is  desirable  for  the 
neural  network  to  model  the  inverse  of  the  truncated 
Voronoi  surfece.  The  truncation  distance  5  is  the  span  of 
the  Voronoi  layer  of  the  neural  network  that  was  previously 
referred  to. 

An  mcampie  will  serve  to  demonstrate  the  neural  network 

Figure  4.  Truncated  Voronoi  Surface 
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pfDcessing  befcm  it  is  specified  in  detail.  Assume  a 
network  widi  a  single  node  that  is  designed  for  ^  x  20 
input  patterns.  Assume  further  that  dtis  node  is  trained  in 
the  manner  explained  earlier,  with  a  learning  rate  of 
a»7.0,  to  recognize  the  pattern  of  points  previously 
presented  in  Figure  3.  After  training,  the  le^xxise  of  the 
node  is  measured  to  mput  patterns  that  ccmsist  of  a  single 
point  in  one  position  in  die  input  matrix.  That  reqxmse  is 
then  plotted  as  a  function  of  dm  position  of  that  point  in  die 
input  field.  Such  a  plot  is  shown  in  Figure  5.  This  plot 
demonstrates  that,  in  a  single  training  pass,  the  network  has 

learned  to  reproduce  the  exact  inverse  of  the  origiiud  _ 

truncated  Voronoi  surface  of  Figure  4.  The  network,  thoi,  F^ure  5.  Network  Node  Response 

is  prqMued  to  compute  the  invent  Erected  Hausdotff 
distance  between  any  input  pattern  and  the  learned  pattern, 

simply  by  projecting  the  points  of  die  input  pattern  onto  this  learned  suifKe.  Although  there  are  additional 
conqiilicadoos  in  practice,  this  is  the  essence  of  how  HAVNET  conducts  pattern  recognition.  The  exact  details  of 
the  recognition  process  are  given  below. 


The  response  of  a  node  n  to  an  input  pattern  A"  is  determined  by  first  corr^uting  the  output  of  the  plastic  layer: 

■  w"  a"  (10) 


Where:  x  ^  1...X  input  x  dimension 
ij  =  Voronoi  iayer  span 

3=  plastic  layer  weight  for  node  n 


y  =  1...Y  input y  dimension 

n  =  1...N  node  number 

if  =  plastic  layer  oiOputs  for  node  n 


Given  the  outputs  from  the  plastic  layer,  the  ou^uts  for  the  Voronoi  layer  tf  are  computed  as  follows: 

-  inax,{max,{vy 


(11) 


The  Voronoi  weights  v,j  are  the  same  for  all  nodes  and  are  conqmted  as  follows: 


t-£S. 


-b  iij  ib 


(12) 


Once  the  outputs  from  the  Voronoi  layer  are  determined,  the  responses  of  the  Hausdorff  (or  output)  layer  neurons 
are  computed: 


net* 


r  X 
ZEc 


W*n"  y=lx=l 


V 


(13) 


Where:  Wg  =  averaging  weight  for  node  n 


nef  =  output  for  node  n 


The  normalizing  quantity  q*  is  determined  as  follows: 

tl’  -  maxjp.VJ) 


(14) 
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Where: 


p: 


r  X 


y-lx-l 


(15) 


P, 


Y  X 


y-Lt-l 


(16) 


And  die  functioo  d  m  die  following  bineiy  threshold  function: 

ri  y  JC>0 

♦“'■Id 


an 


The  final  outputs  net”  indicate,  for  each  node,  the  similarity  of  the  input  pattern  to  the  patterns  that  have  been 
learned  by  duit  node. 


CONCLUSIONS 

HAVNET  represents  a  novd  neural  network  that  is  specifically  designed  for  pattern  recognition.  The  network  is 
well  devdoped  and  well  behaved  mathematically,  and  it  is  the  first  known  neural  network  paradigm  to  take 
advantage  of  die  Hausdorff  distance  as  a  metric  of  siniilarity  between  two-dimensional  patterns.  In  doing  so,  the 
network  inherits  the  dedraUe  properties  of  the  Hausdorff  distance,  and  therefore  diqilicates  human  pnformance 
more  accurately  than  many  previous  neural  network  andiitectiires.  Furthermore,  the  network  architecture  is  flexible 
enou^  to  incorporate  self-organization,  unsupervised  learning,  and  nearest  neighbor  competitive  or  coop«ative 
learning  in  die  plastic  layer  as  requited  by  qiecific  applications.  The  evaluation  of  HAVNET  on  several  pattern 
recognition  tadm  is  ptesendy  in  progress.  Initial  results  on  character  recognition,  and  the  learning  and  recognition 
of  objects  from  edge  images,  are  encouraging. 
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ABSTRACT 

CORTECONS  (COntent-Retentive,  TEmporally-CONnected  neural  networks)  are  a  new  class 
of  neural  network.  CORTECONS  use  a  novel  energy  function,  a  "free  energy"  taken  from 
statistical  mechanics  models  of  Ising  spin  glasses,  to  facilitate  a  richer  range  of  temporal 
behaviors  than  currently  are  available.  This  "free  energy"  includes  a  spatial  configuration 
entropy  term  in  addition  to  the  basic  interaction  energy  that  is  commonly  used.  By  making 
the  interaction  energy  dependent  on  nearer-neighbor  interactions  only,  and  by  using  the 
spatial  configuration  entropy,  minimizing  the  network's  "free  energy"  drives  the  network 
towards  certain  types  of  patterns  rather  than  to  specific,  stored  patterns.  The  specific 
patterns  to  which  the  network  moves  can  thus  be  influenced  by  a  number  of  factors  that  allow 
the  influence  of  previous  system  states.  These  include  current  input,  regular  and  temporally- 
gated  lateral  interconnections,  unit  activation  from  previous  inputs,  and  other  factors.  With 
different  implementation  strategies,  CORTECONS  form  a  class  of  neural  networks  that 
provide  a  basis  for  a  rich  range  of  temporal  pattern  association  capabilities. 


1.  DECOUPLING  THE  DRIVING  FORCES;  A  NEW  APPROACH  IN  NEURAL 
SYSTEMS  DESIGN 

One  of  the  greatest  challenges  in  neural  network  design  is  to  create  a  class  of  neural  networks 
with  richer  temporal  processing  and  association  properties  than  are  currently  existent.  The 
temporal  properties  of  a  network  are  related  to  two  major  factors.  One  is  the  extent  to  which 
either  the  individual  neurons  and/or  the  interconnections  can  maintain  some  temporal 
continuity,  or  have  memory  of  previous  states.  The  other  is  the  nature  of  the  dynamics  which 
govern  network  processes.  When  a  network's  dyiuunics  are  associated  with  the  minimization 
of  an  energy  function,  as  in  the  case  of  the  Hopfleld  neural  network,  we  can  think  of  the 
energy  minimization  as  a  "driving  force"  governing  the  network  processes.  We  can  expand 
our  concept  of  a  "driving  force"  to  include  the  structurally-dependent  process  of  transferring 
weighted  signals  between  neurons.  Such  a  structurally-embedded  "driving  force"  is  the  rule 
in  feedforward  networks.  Current  driving  forces  are  state  specific.  That  is,  they  drive  the 
network  towards  one  of  a  set  of  known,  encoded  states. 

'This  paper  introduces  the  novel  approach  of  decoupling  the  driving  forces  in  a  neural  n^wrak 
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into  two  forces;  one  which  is  state-specific,  and  another  which  is  non-state-specific.  The 
state-specific  driving  force  is  encoded,  as  usual,  within  the  connection  weights  of  the  network. 
The  non-state-specific  driving  force  is  a  free  energy  minimization  process  which  drives  the 
network  towards  states  with  certain  configuration  characteristics,  rather  than  to  specific 
instantiations  of  such  states.  The  interaction  between  the  two  forces  produces  a  class  of 
network  which  not  only  has  good  pattern  response  capabilities,  but  which  can  also  exhibit  a 
wide  range  of  temporal  properties  which  have  not  hitherto  been  found  in  any  neural  network. 
Most  significantly,  this  architecture  promises  a  route  to  more  cortical-like  behavior  of  the 
artificial  neuron  assemblage.  This  leads  to  the  name  for  the  prototype  of  this  new  class  of 
neural  network,  the  CORTECON:  A  COntent-Retentive,  TEmporally-CONnected  network. 

The  use  of  a  free  energy  function  as  a  driving  force  instead  of  the  usual  energy  function  is 
novel  in  neural  network  design.  The  free  energy  function  contains  an  entropy  term,  which 
combines  additively  with  the  energy  term  to  create  the  free  energy.  The  "free  energy" 
concept  has  been  well  established  in  thermodynamics  and  statistical  thermodynamics.  Within 
the  thermodynamics  framework,  a  system  approaches  equilibrium  by  minimizing  its  free 
energy,  rather  than  just  minimizing  the  simple  energy  function.  The  key  to  minimizing  the 
free  energy  of  the  system  is  to  treat  the  system  as  an  ensemble  of  bistate  units,  to  which  the 
principles  of  statistical  thermodynamics  can  be  applied.  The  next  section  describes  the 
structure  and  dynamics  of  a  basic  CORTECON.  The  following  section  gives  some  particulars 
of  the  special  type  of  free  energy  which  is  used  by  the  CORTECON.  A  final  section  presents 
some  results  to  date. 


2.  CORTECONS:  NEURAL  NETWORKS  DRIVEN  BY  FREE  ENERGY 
MINIMIZATION 

CORTECONs  are  a  class  of  novel  neural  networks  that  have  as  a  common  element  two 
features:  processing  units  or  "neurons"  whose  activations  are  dependent  on  previous  states, 
and  the  use  of  "free  energy"  as  a  driving  function.  The  basic  CORTECON  structure  is  a  two- 
layer  neural  network  consisting  of  an  input  layer  and  a  computational  layer,  as  illustrated  in 
Figure  1.  The  input  layer  functions  in  the  usual  manner  of  accepting  inputs  and  propagating 
them  via  weighted  sums  to  the  computational  layer.  The  computational  layer  is  composed  of 
processing  units  which  receive  inputs  and  Gaussian  noise  and  which  also  experience 
activation  decay.  The  state  of  each  processing  unit  is  governed  by  a  function  of  both  its 
activation  (due  to  the  previously  mentioned  factors)  and  the  overall  drive  to  minimize  the  free 
energy.  The  process  of  minimizing  the  free  energy  can  alter  a  unit"s  state.  To  do  this,  the 
absolute  value  of  the  unit's  activation  must  be  less  than  a  predetermined  threshold.  Units 
above  threshold  value  are  "fixed"  on  or  off,  and  stay  that  way  until  changes  in  input  or 
activation  decay  cause  the  activation  to  become  smaller  than  the  (positive  or  negative) 
threshold  value.  Once  a  unit's  activation  is  smaller  than  threshold,  it  becomes  labile,  and  the 
free  energy  minimization  process  can  change  that  unit's  state. 

'The  free  energy  minimization  process  is  conducted  in  a  manner  similar  to  training  a 
Boltzmann  machine.  Units  in  the  computational  layer  are  selected  at  random.  If  a  unit  has 
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activation  smaller  than  threshold,  it's  state  is  changed,  and  the  free  energy  for  the 
computational  (output)  layer  is  redetermined.  If  the  change  results  in  a  lower  total  free 
energy,  the  change  is  kept.  Otherwise,  the  unit  is  returned  to  its  previous  state.  'This  means 
that  even  if  a  unit  is  receiving  positive  activation,  and  would  typically  be  in  an  "on"  state, 
the  free  energy  minimization  process  can  turn  it  off,  and  vice  versa. 

Use  of  free  energy  minimization  for  this  network  is  analogous  to  using  a  Lyapunov  function. 
The  free  energy  is  not  a  time-dependent  function,  nor  is  it  a  potential  energy  function  in  the 
sense  used  for  most  neural  network  Lyapunov  functions.  However,  it  is  used  in  analogy  to 
the  free  energy  minimization  process  observed  in  many  natural  systems.  Use  of  a  free  energy 
function  of  the  type  described  here  implies  that  the  network  exists  at  or  near  an  equilibrium 
state,  and  that  inputs  to  individual  processing  units  in  the  network  are  treated  as  perturbations 
on  the  overall  network  state.  When  a  perturbing  input  is  received  by  the  network,  the 
network  adapts  its  overall  spatial  configuration  so  that  it  accomodates  both  to  the  inputs  to 
each  processing  urat  and  to  the  overall  free  energy  minimization. 


3.  HELMHOLTZ  FREE  ENERGY 
WITH  SPATIAL  CONFIGURATION 
ENTROPY 

The  "computational  layer"  of  this  new  class 
of  neural  network  can  be  modeled  as  large 
1-D  or  2-D  "grids"  of  bistate  processing 
units.  (A  1-D  grid  has  been  used  for 
prototyping  the  CORTECON.)  'This  grid 
can  be  treated  as  an  ensemble  of  bistate 
units,  and  Ising  statistical  mechanics  model 
can  be  applied.  The  basic  formalism  for 
the  Helmholtz  Free  Energy  (named  after  H. 
von  Helmholtz,  who  did  much  of  the 

(1) 

where  A  is  the  Helmholtz  Free  Energy,  E  is  the  energy,  T  is  the  temperature,  and  S  is  the 
entropy.  We  can  express  (1)  in  reduced  form,  by  dividing  through  by  temperature, 
Boltzmann's  constant  (k),  and  the  total  number  of  units  in  the  system  (N).  (Both  the  latter 
terms  are  involved  in  the  expression  for  entropy.)  This  yields 

A  =  £  -  S/(Nk),  (2) 

where  A  and  E  are  the  reduced  Helmholtz  free  energy  and  the  reduced  energy,  respectively. 
The  equilibrium  state  of  a  system  (pressure  and  volume  fixed)  is  defined  as  the  minimum  in 


CORTECON  Engine. 

pioneering  work  in  this  area)  is 

A  =  E  -  TS, 
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the  Helmholtz  free  energy.  Two  processes  contribute  to  this  free  energy  minimization; 
minimizing  the  tmal  energy  of  the  system  (defined  in  terms  of  the  energies  of  individual  units 
and  their  interactions),  and  maximizing  the  entropy  of  the  system.  The  entropy  of  a  system 
describes  the  distribution  of  its  components  into  the  different  possible  states.  Usually  the 
states  which  are  considered  for  entropy  are  the  energy  states  of  individual  units..  An 
alternative  is  to  consider  as  different  "states"  the  variety  of  local  spatial  configurations  of 
processing  units  in  different  states.  This  can  be  used  to  construct  an  entropy  function  which 
drives  the  system  towards  an  spatial  configuration  characterized  by  a  distribution  of  certain 
types  of  local  patterns.  These  micropatterns  are  composed  of  nearest-neighbors  and  next- 
nearest-neighbors,  which  provide  respectively  three  and  six  distinct  types  of  configurations,  as 
is  shown  in  Figure  2,  for  configurations  composed  of  units  in  one  of  two  states,  A  or  B. 

A  "conriguiation"  in  terms  of  the  cluster  variation  theory,  on  which  this  work  is  based,  can  be 
either  a  "pairwise"  configuration,  e.g.  the  pair  A-B  or  the  pair  B-A  (both  would  be  counted  as 
the  same  type  of  configuration),  or  as  a  "triple,"  e.g.,  A-B-A.  The  configurations  which 
appear  differently  when  viewed  right-to-left  vs.  left-to-right  are  treated  as  different 
instantiations  of  the  same  configuration,  but  are  weighted  doubly  with  the  redundancy 
parameters  pj  and  Yi<  as  is  shown  in  Figure  2. 

The  specific  Helmholtz  free  energy  equation  which  is  used  as  a  driving  function  for  this  class 
of  neural  network  is  given  as  [Maren  et  al.,  1992;  Kikuchi  &  Brush,  1967] 


A  =>y  NkT  =2j8e(  +Z3  +Z4  -z^) 


+ 


1*0  (  I )  * 


6 

i =1 


4A(Z3  +  Z5 


(3) 


where 

€  is  the  interaction  energy  between  processing  units, 

P  is  the  Boltzmann's  constant, 

yi  and  Zj  are  the  cluster  variables,  as  illustrated  in  Figure  2, 

Pi  and  Yi  are  cluster  variable  coefficients  that  account  for  redundancy  in  the  way  a 
given  cluster  variable  can  be  measured,  and 
p  and  k  are  Lagrange  coefficients. 

The  term  Lf(x)  is  given  as 

Lf{x)  =  xln{x)  -  X,  (4) 
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The  first  term  on  the  RHS  of  Eq.  (3)  is  the  interaction  energy,  which  is  negative  when 
neighboring  units  are  in  the  same  state.  The  next  two  terms  are  the  entropy,  and  express  the 
distribution  of  the  ensemble  into  different  types  of  local  spatial  configurations.  The  variables 
used  to  describe  these  relationships  are  the  nearest-neighbor  configurations  variables,  and 

the  "triples,"  Zj,  both  shown  in  Fig.  2.  The 
remaining  two  terms  are  Lagrangian 
multiplier  terms. 


4.  RESULTS  OF  NETWORK 
OPERATIONS 


Early  studies  with  the  CORTECON 
(Maren  et  al.,  1992;  Maren,  1993) 
confirmed  that  the  free  energy 
minimization  process,  as  carried  out  via 
the  stepwise  process  of  "flipping"  unit 
states  and  testing  the  new  free  energy, 
produced  results  which  met  theoretical 
predictions  for  the  free  energy  of  the 
computational  layer.  Recent  work  has 
focused  on  identifying  the  pattern 
association  abilities  of  the  network,  and  on 
introducing  design  features  which  give  the 
network  unique  and  interesting  temporal 
capabilities.  These  design  features 
include  additive  noise  in  the  computational 
layer  units,  exponential  activation  decay  of 
patterns  once  input  stimulus  is  removed, 
and  use  of  intemeurons  to  strengthen  the 
activation  of  units  in  response  to  a  present 
pattern,  or  to  enhance  temporal  association 
with  a  succeeding  pattern. 

Our  pattern  association  studies  have 
confirmed  that  once  the  input-to- 
computational  layer  connection  weights 
have  been  briefly  trained  using  a  variation 
of  Hebbian  learning,  it  is  possible  to  gain 
recognizable  recall  of  "prototype"  output 
patterns  associated  with  a  given  input 
pattern.  Prototype  output  patterns  were 
identified  for  each  of  the  randomly 
established,  stored  input  patterns  used  in 
training  and  testing  the  network.  They 
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Figure  2:  The  Fraction  Variables  from  Cluster 
Variation  Theory. 
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were  obtained  by  randomly  presenting  the  different  input  patterns  at  least  50  times  after 
network  training.  The  resultant  output  patterns  for  each  distinct  input  were  stored  and 
averaged.  For  testing,  the  inputs  were  again  presented  a  similar  number  of  times,  and  the 
normalized  Hamming  distance  between  the  resultant  output  pattern  and  the  corresponding 
prototype  was  found.  Hamming  distances  between  each  of  the  prototypes  (in  pairwise 
manner)  was  also  found.  We  found  that  the  Hamming  distance  between  the  prototypes  was 
typically  3-4  times  as  large  as  the  Hanuning  distance  between  a  resultant  pattern  and  its 
associated  prototype.  This  gives  confidence  that  the  unit  clustering  caused  by  action  of  the 
free  energy  driving  force  does  not  too  greatly  distort  the  heteroassociative  capabilities  of  this 
network. 

The  combination  of  free  energy  (which  causes  clustering  of  like  units)  with  learned  and 
sparse  lateral  connections  or  interneurons,  becomes  valuable  in  maintaining  output  pattern 
stability  during  the  activation  decay  which  follows  when  the  input  pattern  is  removed.  When 
an  input  pattern  is  presented,  clusters  of  like  units  will  develop  in  the  output  layer  as  a  result 
of  the  free  energy  minimization.  The  "core"  units  of  such  clusters  typically  have  high 
activation  values,  and  so  are  impervious  to  the  random  "flipping"  of  units  with  less  activation. 
When  the  input  pattern  is  removed  and  activation  decay  begins,  the  interactions  between  like 
nearest  neighbors  in  the  clusters  help  to  "persist"  the  cluster  for  a  longer  time  than  would  be 
so  if  the  free  energy  minimization  process  were  not  present.  Further,  the  intemeurons 
established  between  the  strongest  units  (whether  "on"  or  "off')  are  designed  to  persist  the 
state  of  the  receiving  units.  This  helps  them  maintain  their  original  state.  Intemeurons  have 
also  been  designed  to  facilitate  association  and  stabilization  of  temporally-paired  patterns. 
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Deformed  lattice  analysing  using  neural  networKS 
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Optical  methods  of  measuring  in-plane  deformations  and  displacements  on  the 
surface  of  specimen  and  workpieces  are  used  in  many  branches  of 
metalworking  industry,  development,  investigation,  and  science.  They  are,  for 
example,  the  method  of  visioplasticity,  interferometric  and  moire  techniques 
etc.  All  this  methods  are  based  on  the  optical  analysis  of  resulted  grid  structures 
and  measure  of  their  interested  points  or  lines.  Especially,  the  method  of 
visioplasticity  is  often  applied  in  the  analysis  of  large  plastic  deformations  in 
metal  cutting  and  forming  processes  and  for  quality  control.  An  typical  example 
for  a  visioplastic  application  (metal  cutting  analysis)  is  proposed  in  Fig.l 
below. 

The  image  analysis  of  deformed  grids  may  help  for  development  and 
investigation  of  pairs  of  workpiece  and  cutting  material,  optimized 
technological  conditions  and  new  tool  material  which  may  allow  cutting 
without  coolants  and  lubricants  or  with  less  use  of  coolants  and  lubricants. 
However,  for  a  comprehensive  and  economical  industrial  inset  of  this  methods 
it's  necessary  to  automatize  the  recognization  of  grid  points  and  the  analysis  of 
deformations.  For  the  processing  of  given  points  exists  some  programs  and 
software  systems  like  the  system  VISIO,  developed  by  the  Society  for 
Production  Engineering  and  Development  (GFE)  Chemnitz,  but  the  analysis  of 
large  or  extiem  deformed  grid  structures  continues  to  be  only  practicable 
manually  (using  microcsopes,digitizers  etc.).  The  described  paper  will  propose 
a  solution  for  recognize  points  in  large  deformed  and  injured  lattices  using 
digital  image  processing  in  combination  with  neural  networks. 

For  the  recognition  of  small  or  medium  deformed  lattice  structures  exists  a 
collection  of  useful  and  exactly  algorithms.  These  are,  for  example,  binary 
image  processing,  different  filters,  splines  and  Fourier  Transform.  However,  if 
the  deformation  of  the  surface  is  large  or  extrem,  the  results  of  this  "classical" 
algorithms  are  unexactly  or  false,  because  most  methods  requires  a  similar  form 
of  grid  crosses.  In  addition  to  this,  in  consequence  of  then  large  deformation  the 
grids  are  often  injured.  During  the  last  three  years,  we  have  investigated  the 
demeanour  of  traditional,  usual  algorithms  for  lattice  analysis  like 
skeletonization  of  binary  images  and  Fourier  Transform  in  visioplastic 
experiments  (Fig.  1)  with  a  deformation  of  some  hundred  percent.  Usually  the 
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results  of  this  methods  are  false,  and  not  one  of  this  algorithms  has  produced 
evaluable  results. 


Fig.l:  Typical  example  for  a  large  deformed  lattice  in  visioplastic 
investigations 

(metal  cutting  analysis). 

In  this  cause,  it's  necessary  to  find  a  way  to  search  and  range  the  objects 
automatically,  including  the  learning  of  new  search  patterns  (auto-adaption).  A 
hopeful  way  is  the  usage  of  procedures  of  biocomputing,  like  fuzzy  logic  and 
neuronal  networks.  Particularly  some  models  of  neuronal  networks  appears 
qualified  for  problems  of  pattern  recognition,  and  tentatively  experiences 
confirm  the  eligibility  of  this  models  for  the  cross  analysis  in  deformed  lattices. 

In  first  investigations  was  choosed  the  backpropagation  model  with  a  sigmoid 
activation  function.  This  model  is  applied  often  in  the  optical  recognition  of 
patterns  and  characters  (OCR).  A  backpropagation  network  usually  contains 
three  or  more  neuron  layers.  The  output  of  each  neuron  of  one  layer  is 
connected  to  an  input  of  each  neuron  of  the  next  layer;  all  of  this  connections 
are  weighted.  A  source  pattern  is  entering  to  the  inputs  of  the  neurons  of  the 
first  layer.  The  network  propagates  the  input  pattern  layer  by  layer.  Finally,  on 
the  outputs  of  the  neurons  of  the  output  layer  appears  the  target  pattern.  A 
backpropagation  network  needs  an  "supervisor"  during  the  learning  phase,  i.e. 
the  target  pattern  must  be  known.  Thereupon,  for  each  pair  of  patterns  the 
network  adjust  the  weights,  until  the  difference  between  the  target  pattern  and 
the  output  pattern  is  minimal. 
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manually.  It  was  performed  learning  cycles,  until  the  average  error  per  pair  of 
pattern  was  less  than  0. 1 .  Usually  a  backpropagation  network  needs  no  more  than 
some  hundred  cycles. 

During  the  "recall"  phase,  the  part  of  the  image  evaluable  by  the  network  is 
moving  pixel  by  pixel,  until  the  complete  image  is  scanned.  For  each  of  this  sub¬ 
images  Ae  backpropagation  network  gives  an  output  value  in  range  between  0 
and  1.  This  value  kan  be  interpret  as  a  "level  of  comparability"  between  the 
learned  patterns  and  the  real  part  of  the  lattice.  All  output  values  will  be  stored 
into  a  file  on  a  harddisk  and  form  a  matrix. 

This  matrix  is  equal  to  the  filtered  image  and  contains  the  probabihties  of  the 
existence  of  an  object  for  each  picture  element.  Like  an  image,  thereupon  the 
matrix  can  be  processed  using  traditional  image  processing  method.  For 
example,  if  the  location  of  the  crosses  is  wanted,  then  the  local  maxima  greater 
than  a  dneshold  (ca.  0,5..0,7)  are  interpretable  as  probably  crosses.  For  this 
problem  also  exists  some  algorithms  in  the  classical  image  analysis. 


Fig.2:  The  matrix  of  output  values, 
shown  as  image. 


Fig.3:  Local  maxima  of  the  output 

values  in  the  original 

image. 
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An  effective  simulation  of  neuronal  networks  needs  a  very  powerful  hardware 
and  a  large  memory.  Usually,  for  networks  in  the  describted  size  (some  1000  up 
to  10.(X)0  connections)  are  using  special  hardware  features  (emulators). 
However,  in  this  investigations  special  hardware  was  not  in  use.  In  this  cause, 
some  essential  restrictions  was  requisitely.  It's  desirable  to  connect  each  picture 
element  (pixel)  in  the  sub-image  evaluable  per  step  to  an  input  neuron.  The 
largest  networks  creatable  into  the  main  memory  of  an  MS-DOS  PC  contents 
ca.  13x13  input  neurons.  However,  the  recgnition  of  a  complete  cross  in  the 
applied  lattices  needs  a  size  of  the  sub-image  by  ca.  25x25  pixels.  Therefore, 
ever  2x2  adjacent  pixels  must  be  summarize,  and  the  network  receives  only  the 
average  value  of  this  four  pixels.  In  this  cause,  particulary  the  quality  of  small 
lines  in  the  image  is  debased. 

Nevertheless,  the  most  results  of  the  investigated  networks  are  remarkable.  In 
the  parts  of  small  lattice  deformations  the  backpropagation  network  has  all 
cross  points  recognized  unmistaken  and  unambiguous,  with  a  deviation  of  the 
cross  localisation  of  one  pixel  maximum.  In  the  large  deformed  parts  (Fig,  3/4) 
was  detected  90%  minimum  of  all  the  cross  points,  nevertheless  the  grids  have 
wide  areas  injured.  This  results  are  better  and  more  exactly  than  the  results  of 
most  traditional  image  analysing  methods. 
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AUTOMATIC  TARGET  RECOGNITION  FROM 
RADAR  RANGE  PROFILES  USING  FUZZY  ARTMAP 


Mark  A.  Rubin' 

Physics  Division,  Naval  Air  Warfare  Center^ 

China  Lake,  California  93555 

Abstract 

We  investigate  the  use  of  the  Fussy  ARTMAP  neural  network  for  automatic  classification 
of  targets  based  on  their  radar  range  profiles.  Tests  on  synthetic  data  indicate  that  arbitrarily 
high  accuracy  may  be  achieved  by  increasing  sufficiently  the  number  of  aspects  used  during 
training.  Creation  of  ‘‘artificial  training  sets”  by  interpolation  of  input  data  is  examined  as  a 
potential  means  of  decreasing  the  number  of  training  aspects  required  to  achieve  a  given  level 
of  accuracy,  and  is  shown  to  be  of  limited  effectiveness.  The  problem  of  rejecting  patterns 
not  present  in  the  training  set  is  also  examined. 


1  Radar  range  profiles 

A  “range  profile”  of  an  object  is  a  sort  of  “one-dimensional  picture”  of  the  object,  generated  from 
its  radar  reflection[l,  2,  3].  Imagine,  for  example,  a  sharp  pulse  of  electromagnetic  energy  being 
directed  at  an  aircraft  which  is  flying  directly  towards  the  place  where  the  traiumitter  of  the  pulse 
is  located.  A  receiver,  located  at  approximately  the  same  spot  as  the  transmitter,  will  not  pick 
up  the  same  sharp  pulse  which  was  originally  sent  out.  Radiation  reflected  from  the  nose  of  the 
aircraft,  having  less  far  to  travel,  will  arrive  back  at  the  receiver  first,  followed  by  reflections  off 
of  parts  of  the  interior  of  the  aircraft  and  the  wings,  followed  finally  by  reflections  from  the  tail. 
If  the  strength  of  the  reflections  is  recorded  in  “bins”  corresponding  to  time  of  arrival — i.e.,  the 
initial  reflection  from  the  nose,  and  anything  following  within  a  time  At,  is  summed  and  recorded 
in  bin  1,  everything  arriving  between  At  and  2 At  goes  in  bin  2,  etc. — the  resulting  histogram  of 
reflected  energy  forms  a  pattern  which  is  termed  a  “radar  range  profile”  or  “downrange  profile.” 
The  bins  are  referred  to  as  “range  bins,”  since  reflections  which  arrive  at  the  receiver  within  a  time 
At  must  have  come  from  parts  of  the  target  whose  respective  distances  from  the  receiver  differ  by 
no  more  than  cAt/2. 

Clearly,  different  aircraft  (or  other  targets)  will  yield  different  profiles;  so  it  is  natural  to 
consider  using  these  profiles  to  identify  the  target.  Just  as  clearly,  the  profile  of  a  given  target 
will  change  as  a  function  of  the  relative  orientation  of  the  target  with  respect  to  the  line  from 
the  transmitter/receiver  to  the  target.  (The  overall  amplitude  of  the  profile  will  also  change 
as  the  distance  to  the  target  changes,  but  this  can  always  be  compensated  for  by  applymg  a 
normalization  procedure  to  all  profiles.)  In  this  regard,  range  profiles  are  no  different  from  the 
usual  two-dimensional  images  of  objects  formed  by  optical  lenses;  e.g.,  the  appearance  of  the 
computer  monitor  before  which  I  am  sitting  changes  somewhat  if  move  my  head  to  the  left  or 
right  a  couple  of  inches,  and  changes  drastically  if  I  view  it  from  above  or  the  side. 

The  distinguishing  feature  of  range  profiles  is  that  even  small  changes  in  viewing  angle  tend 
to  result  in  large  changes  in  “appearance.”  Since  the  object  being  “viewed”  is  extended  in  the 
direct' Jns  transverse  to  the  direction  of  the  rzular  beam,  and  since  the  radar  is  coherent,  reflections 
which  fall  into  the  same  bin  will  exhibit  interference.  This  coherent  interference  is  analogous  to 
the  speckle  observed  when  viewing  an  object  under  laser  light,  but  is  of  much  greater  magnitude, 
due  to  the  differences  in  the  wavelengths  of  the  radiation  and  the  apertures  involved.  As  a  result, 
small  changes  in  viewing  direction  can  produce  large  changes  in  the  constructive  or  destructive 
nature  of  the  interference  in  the  bins,  and  hence  in  the  shape  of  the  profile.  Simulated  range 
profiles  of  the  same  target  viewed  from  directions  only  one  degree  apart  are  shown  in  Fig.  1. 
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Figure  1:  Synthetic  range  profiles  of  the  first  target  in  Fig.  2,  at  aspects  of  0  (left)  and  1  (right) 
degrees,  after  normalization. 

2  Fuzzy  ARTMAP 

Since  a  given  t'^rget  will  generally  yield  very  different  profiles  when  viewed  from  directions  differing 
even  slightly,  a  range-profile  classifier  must  be  able  to  group  together  input  patterns  with  widely 
varying  characteristics  and  assign  them  to  a  commcm  classification  category.  At  the  same  time,  the 
large  number  of  different  patterns  per  category  requires  that  the  information  needed  to  accomplish 
the  task  of  classification  be  recorded  in  as  efficient  as  a  manner  as  possible.  These  requirements 
suggest  the  ^plication  of  the  Fuzzy  ARTMAP  neural  network[4]  to  this  problem. 

The  core  of  Fuzzy  ARTMAP  is  a  F\izzy  ART  network[5]  which  performs  unsupervised  classi¬ 
fication  of  analog  vectors  with  components  between  0  and  1.  Similar  input  vectors  are  associated 
with  the  same  recognition  node,  “similarity”  being  determined  by  an  Lj-like  norm.  During  train¬ 
ing,  these  recognition  nodes^  become  associated  with  nodes  representing  the  categories  into  which 
the  vectors  are  to  be  classified  (e.g.,  type  of  aircraft).  If  a  vector  is  presented  which  matches  a 
recognition  node  which  has,  through  previous  truning,  become  associated  with  a  particular  cate¬ 
gory,  the  vector  is  considered  to  have  been  classified  in  that  category.  During  the  truning  phase 
an  incorrect  category  choice  by  the  network  causes  the  network  to  repeatedly  reassign  the  input 
vector  to  different  recognition  nodes  or,  if  necessary,  create  an  entirely  new  recognition  node.  In 
this  way  the  network  learns  to  make  correct  predictions  for  the  vectors  on  which  it  is  trained, 
while  creating  only  the  minimum  number  of  recognition  nodes  necessary  for  the  task.  Storage 
requirements  are  thus  minimized,  and  generalization  ability  is  maximized. 

3  Data  simulation  and  classification 

We  consider  only  two-dimensional  targets,  so  the  relation  between  the  direction  of  the  radar  beam 
and  the  orientation  of  the  target  is  given  by  a  single  aspect  angle,  which  will  be  taken  to  be  zero 
when  the  aircraft  is  heading  directly  towards  the  source  of  the  beam.  To  generate  simulated  range 
profiles,  the  return  signal  will  be  calculated  as  being  produced  by  100  point  scatterers  located 
randomly  within  the  target,  and  the  far-field  approximation  will  be  used.  Where  not  otherwise 
specified,  simulations  will  be  done  using  the  nine  “aircraft”  shown  in  Fig.  2.  The  length  of  each 
target  is  about  10  meters,  and  the  wavelength  of  the  radar  is  two  centimeters.  Thirty  range  bins 
ate  used,  each  covering  2/3  of  a  meter,  so  the  entire  range  profile  covers  20  meters  with  returns 

*Lagrer  in  [4]. 
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Figure  2:  Scattering  centers  for  set  of  nine  targets  (three  wing  positions,  three  wing  lengths). 


from  the  targets  filling  15  to  20  of  the  bins.  The  profiles  are  normalized  by  dividing  the  value  of 
each  bin  of  the  "raw”  profile  by  the  sum  of  dl  the  bins. 

The  baseline  vigilance  parameter,  which  governs  the  fineness  with  which  the  initial  unsuper¬ 
vised  classification  (association  with  recognition  nodes)  is  made,  is  kept  at  its  minimum  value  of 
zero,^  as  this  seems  to  give  the  lowest  error  rate  when  the  network  is  asked  to  classify  targets  of  a 
type  which  it  has  been  trained  to  recognize.  (See,  however,  Sec.  7  below.)  The  voting  procedure 
is  also  used,  in  which  a  set  of  Fuzzy  ARTMAP  networks,  each  of  which  has  been  trained  with  a 
different  random  permutation  of  the  training  data,  vote  on  the  classification  of  a  test  input.  Una¬ 
nimity  is  required  for  the  "voters”  to  be  considered  to  have  made  a  choice.  Comparatively  little 
improvement  was  found  in  having  an  "electorate”  of  more  than  two  members  (again,  however,  see 
Sec.  7),  and  this  is  the  number  used  unless  otherwise  stated  . 

for  the  few  other  adiusteble  paraineten  of  Fhtsjr  ARTMAP,  the  dioice  parameter  it  tet  at  .0001,  and  fast 
learning  U  uaed.  Complement  coding  in  employed  in  the  Fhxxy  ART*  module. 


m-293 


1  aspect  (degrees)  { 

1  TtTs  1 

1  40  to  50  1 

85  to  95 

28.1 

43.5 

35.3 

54.7 

19.8 

II  4a.  av.  #  nodes 

1  39 

106 

163.5 

1  -.5  to  .5 

4to5 

44  .5  to  45.5 

49  to  50  1 

1  89.5  to  90.5 

94  to  95  1 

lb.  %  abst./aU 

22.5 

2b.  %(ab8t.  -(•  err.)/all 

■TffI 

28.8 

3b.  %  err./pred. 

3.04 

8.07 

15.7 

4b.  av.  #  nodes 

16 

20 

26.5 

27 

33.5 

37.5 

Table  1:  Training  and  testing  at  various  aspects.  la,b:  Percentage  of  abstentions  (out  of  all 
tests).  2a, b:  percentage  of  a^tentions-plus-errors  (out  of  all  tests).  3a, b:  percentage  of  errors 
(out  of  predictions  made).  4a, b:  average  number  of  recognition  nodes  (averaged  over  both 
voters).  Training  and  testing  was  on  views  of  the  nine  targets  shown  in  Fig.  2.  Training  views 
were  uniformly  spaced  a  tenth  of  a  degree  apart.  Testing  views  were  randomly  chosen  within  the 
specified  ranges  of  aspects. 


1  #  of  targets  | 

3 

=1  wp  X  3  wl 

6 

=2  wp  X  3  wl  1 

1 

-5  to  5 

40  to  50 

85  to  95  1 

1  40  to  50 

85  to  95  1 

1. 

%  abst./all 

8.68 

2. 

%(abst.  +  err.)/all 

12.0 

XTX 

3. 

%  err./pred. 

■SOI 

3.41 

3.61 

14.4 

4. 

av.  #  nodes 

8 

13.5 

10 

21 

64 

66.5 

Table  2:  Fewer  targets.  Same  as  first  four  rows  of  Table  1,  but  using  subsets  of  the  targets  in 
Fig.  2.  Results  in  the  first  three  columns  are  from  networks  trained  and  tested  on  three  targets 
with  one  wing  position  and  three  wing  lengths  (the  first  column  of  Fig.  2);  results  in  the  last  three 
columns  are  from  networks  trained  and  tested  on  six  targets  with  two  wing  positions  and  three 
wing  lengths  (the  first  two  columns  of  Fig  2). 

4  Variation  of  aspect  and  number  of  targets 

Results  for  the  targets  of  Fig.  2  are  given  in  Table  1.  Only  the  polled  results  of  the  two  voting 
networks  are  given;  disagreement  was  counted  as  an  abstention.  The  networks  were  trained  on 
views  of  each  target  evenly  spaced  a  tenth  of  a  degree  apart.  The  views  with  which  the  networks 
were  tested  were  randomly  chosen.  The  first  four  rows  of  Table  1  show  results  from  networks 
trained  and  tested  on  aspects  differing  by  up  to  ten  degrees.  The  spread  of  ten  degrees  was  chosen 
because  this  is  more  or  less  the  accuracy  with  which  the  aspect  of  an  aircraft  can  be  estimated  at 
a  distance.  The  last  four  rows  of  Table  1  show  results  for  one-degree  regions,  and  illustrate  that 
increasing  the  angle  over  which  the  networks  are  trained  does  not  in  general  degrade  performance 
(provided,  of  course,  that  number  of  training  views  per  degree  is  maintained.)  Note  also  that  the 
number  of  recognition  nodes  goes  up  more  slowly  than  the  number  of  training  views,  indicating 
that  Frizzy  ARTMAP  is  economically  “reusing”  them. 

An  important  question  is  the  scaling  of  the  error  rates  and  network  size  with  the  number 
targets  on  which  the  networks  are  trained.  Comparing  the  first  four  lines  of  Table  1,  for  networks 
trained  on  all  of  the  targets  of  Fig.  2,  with  the  results  presented  in  Table  2  for  networks  trained 
on  subsets  of  those  targets,  we  see  that  the  error  rates  may  be  either  larger  of  smaller  as  the 
number  of  targets  increases;  the  most  important  factor  in  determining  the  error  rate  seems  to  be 
the  particular  viewing  aspect.  The  number  of  recognition  nodes  increases  with  increasing  number 
of  targets,  and  apparently  at  a  greater-than-proportional  rate. 

Tests  were  also  done  on  different  sets  of  targets,  groups  of  sixteen  (four  wing  positions,  four 
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#  of  targets 

lie 

25 

LJ6 

2ri 

aspect  (degrees  ) 

1  -.5  to  .5  1 

mEEsmi 

~r  %  abst./^l 

■TO 

26.1 

2.  %(abst.  -1-  err.)/dl 

on 

35.9 

3.  %  err./pred. 

7.56 

13.3 

6.64  II 

4.  av.  #  nodes 

41.5 

88.5 

1  147 

322.5  1 

Table  3:  More  targets.  Same  as  first  four  rows  of  Table  1,  but  trained  and' tested  on  larger  sets  of 
more-similar  targets. 


wing  lengths)  and  twenty-five  (five  wing  positions,  five  wing  lengths).  The  maximum  and  minimum 
values  of  the  wing  lengths  and  wing  positions  were  the  same  as  for  the  nine  targets  in  Fig.  2,  so 
the  targets  in  these  larger  sets  were  not  only  more  numerous  but  geometrically  more  similar  to 
one  another.  Results  using  these  target  sets,  with  training  views  spaced  a  tenth  of  a  degree  apart, 
are  given  in  Table  3. 

5  Separability  and  variation  of  training  view  spacing 

Not  surprisingly,  increasing  the  number  of  training  views  per  degree  decreases  the  error  rates  for 
a  given  set  of  targets.  One  would  like  to  know:  how  fast  does  it  decrease?  And  does  it  go  to  zero, 
or  approach  some  greater-than-zero  limit?  For  example,  the  latter  would  be  the  case  if,  in  the 
space  of  posssible  range  profiles  (in  this  case,  a  30-dimensional  space),  the  patterns  corresponding 
to  different  targets  formed  partly-overlapping  clusters.  In  the  regions  of  overlap  it  would  be 
impossible  to  distinguish  targets  by  their  range  profiles.  In  principle,  this  problem  ahonld  occur; 
there  is  no  a  priori  reason  why  two  distinct  targets,  at  certain  respective  viewing  angles  and  for 
certain  bin  sizes  and  wavelengths  of  illumination,  could  not  have  arbitrarily  similar  range  profiles. 
In  practice,  for  the  set  of  9  targets  in  Fig.  2,  increasing  the  number  of  training-views-per-degree 
rapidly  decreases  the  error  rate  (out  of  predictions  made)  to  zero,  and  the  abstention  rate  to 
less  than  half  a  percent.  (See  Fig.  3).  Such  a  “brute  force"  approach  to  higher  accuracy  is  not 
a  practical  one.  (Approaches  for  increasing  accuracy  with  a  fixed  number  of  training  views  are 
briefly  mentioned  in  Sec.  8).  However,  it  does  suggest  that  at  least  there  may  be  no  obstacle  in 
principle  to  reaching  high  accuracies. 

6  Interpolation  of  input  data 

One  approach  to  keeping  down  the  number  of  training  views  is  to  create  “extra”  training  data  by 
interpolation.  That  is,  we  take  two  profiles  of  the  same  target  at  nearby  aspects,  and,  assume  that 
profiles  taken  from  in-between  aspects  will  vary  smoothly  from  one  of  the  measured^  profiles  to  the 
other  as  the  aspect  is  changed,  calculate  profiles  for  the  in-between  aspects.  The  measured  profiles 
and  the  interpolates  are  then  used  together  as  the  training  set,  as  if  a  larger  set  of  measurements 
with  more  closely-spaced  views  had  been  taken. 

This  procedure  is  probably  better  suited  to  tasks  other  the  one  at  hand,  given  that  the  essential 
characteristic  of  the  problem  is  the  non-smooth  variation  of  the  profiles  as  the  aspect  is  changed. 
Upon  investigation,  the  approach  is  found  to  have  some  utility  when  the  measured  profiles  are 
spaced  relatively  coarsely  (and  no  more  than  four  or  five  interpolates  are  used  between  psdrs  of 
adjacent  measured  profiles),  but  to  be  less  effective  as  the  separation  of  the  measured  profiles  is 
decreased.  (See  Fig.  4.)  In  particular,  it  does  not  seem  to  be  a  promising  avenue  for  achieving 
high  accuracy. 

*Of  coune,  all  of  the  "dsts”  in  the  present  paper  is  synthetic;  we  have  in  mind  here  a  practiced  implementation 
where  the  range  profiles  are  obtained  by  actuid  measurement. 
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Figure  3:  Closer  training  views.  Percentages  of  abstentions  (empty  boxes)  and  abstentions-plus- 
errors  (filled  boxes)  out  of  all  tests,  and  percentage  of  errors  out  of  predictions  made  (circles),  for 
the  nine  targets  of  Fig.  2,  as  a  funtion  of  numbers  of  uniformly-spaced  training  views  per  degree 
(abscissa).  Testing  and  training  views  were  at  aspects  between  -.5  and  .5  degrees. 


Figure  4:  Interpolation.  Same  as  Fig.  3,  but  using  extra  interpolated  training  profiles.  Abscissas 
are  numbers  of  interpolates  between  each  pair  of  adjacent  measured  profiles.  Left:  two  measured 
profiles  at  -.5  and  .5  degrees,  resp.  Right:  eleven  measured  profiles  evenly  spaced  from  -.5  to  .5 
degrees. 
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Figure  5:  Unknown-target  rejection  versus  known-target  error.  From  bottom  to  top  along  each 
curve,  the  data  points  correspond  respectively  to  baseline  vigilance  values  of  0,  .925,  .95,  and  .975. 
All  ordinates  are  percentages  of  unknown  targets  which  are  rejected.  Abscissas  are  percentages  of 
abstentions-plus-errors  out  of  ail  tests  (filled  boxes),  and  percentage  of  errors  out  of  predictions 
made  (circles),  for  the  nine  targets  of  Fig.  2.  'Training  and  testing  views  are  at  aspects  between 
-.5  and  .5  degrees;  training  views  uniformly  spaced  a  tenth  of  a  degree  apart;  two  voters  on  the 
left,  four  voters  on  the  right  "Known”  targets  are  the  six  targets  in  the  top  two  rows  of  Fig.  2, 
"unknown”  targets  are  the  three  in  the  last  row.  Vigilance  was  the  same  during  training  and 
testing.  (The  number  of  test  patterns  for  unknown  target  rejection  was  500  for  each  value  of 
baseline  vigilance,  half  the  number  used  or  testing  in  all  other  runs  in  this  paper.) 

7  Rejection  of  unknown  targets 

In  all  of  the  tests  presented  so  far,  the  set  of  training  targets  has  been  identical  to  the  set  of 
test  targets.  So,  the  task  facing  the  neural  network  has  been  to  assign  a  test  profile  to  a  target 
category,  given  that  the  profile  is  definitely  known  to  correspond  to  one  of  a  fixed  set  of  categories. 
An  equally  important  task  b  rejection  of  profiles  corresponding  to  unknown  tugets,  i.e.,  targets 
for  which  the  network  has  not  been  trained.  In  the  basic  form  in  which  we  have  been  operating 
it  up  to  this  point.  Fuzzy  ABTMAP  has  had  only  one  mechanism  with  which  to  reject  unknown 
categories:  it  can  abstain  because,  although  each  voter  has  made  a  choice  of  classification,  their 
choices  have  not  been  unanimous.  A  simple  way  of  providing  another  mechanbm  for  rejection  is  to 
raise  the  baseline  vigilance.^  As  mentioned  above,  nonzero  baseline  vigilance  seems  to  somewhat 
worsen  performance  on  the  known-target  recognition  task;  however,  it  improves  performance  on 
the  unknown-target  rejection  task,  a  not-unexpected  tradeoff.  In  Fig.  5  the  rate  of  abstentions  on 
unknown  targets  (ideally  100%)  is  plotted  against  both  the  error  rates  (out  of  predictions  made) 
and  the  combined  error-plus-abstention  rate  on  known  targets  (both  ideally  0%). 

(Interestingly,  the  rate  of  abstention  by  individual  voters  is  less  than  1%  for  all  but  the  highest 
level  of  baseline  vigilance  (.975)  in  Fig.  5,  and  the  increase  in  the  ability  to  reject  unknown 
targets  seems  due  to  the  increased  number  of  Fj  nodes  which  nonzero  baseline  vigilance  causes 
to  be  crested.  At  baseline  vigilance=.975,  abstention  by  single  voters  has  become  significant, 
accounting  for  about  three-fourths  of  the  total  abstentions  on  unknown  targets.) 

^During  the  testing  phase  the  learning  rate  is  set  to  sero  and  no  new  nodes  are  added,  so  the  netsrork  will 
on  test  inputs  which  do  not  satisfy  the  vigilance  criterion  for  any  of  the  allocated  nodes. 
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8  Outlook 


The  initial  investigations  which  have  been  reported  above  show  the  promise  of  Fuzzy  ARTMAP 
for  the  range  profile  recognition  task.  Several  modifications  of  the  basic  setup  employed  here 
suggest  themselves  as  ways  of  achieving  the  higher  levels  of  performance  required  for  a  production 
system.  Greater  accuracy  in  the  recognition  of  familiar  targets  could  be  accomplished  by  basing 
classification  on  a  ssccessionof  profiles[2],  rather  than  a  single  “snapshot.”  (An  extension  of  Fuzzy 
ARTMAP  suited  to  such  an  approach  already  exists,  ART-EMAP[6]).  As  for  increasing  the  rate  of 
unknown-target  rejection,  greater  effectiveness  might  be  achieved  by  utilizing  the  outputs  of  many 
“specialist”  networks,  all  of  which  are  trained  on  the  same  training  set,  but  each  of  which  is  trained 
to  give  only  a  binary  “present”  or  “absent”  response  to  one  specific  target.  These  approaches  are 
under  study. 
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ABSTRACT 

The  Clinical  Matrix  [1]  can  be  used  to  diagnose  many  psychiatric 
disorders.  Data  collected  by  clinical  physicians  can  be 
summarized  in  such  a  way  that  neural  networks  can  be  implemented 
to  produce  an  educated  summary  as  to  the  proper  Clinical 
Psychiatric  Disorder  (CPD's).  By  using  neural  networks  to  help 
make  diagnoses  based  on  the  numerous  symptoms  that  are  inherent 
to  CPD's,  there  is  the  possibility  for  quicker  and  conceivably 
more  accurate  identification  of  the  psychiatric  disorder.  To 
consider  Neural  Networks  based  on  the  use  of  the  Clinical  Matrix 
as  a  tool  for  the  medical  community  could  only  enhance  the 
clinical  diagnostic  procedure. 

KEY  WORDS 

Clinical  Matrix,  Neural  Networks,  diagnosis,  learning,  recalling, 
robustness,  back-propagation,  hidden  layers,  transfer  function. 

I.  INTRODUCTION 

By  using  the  Clinical  Matrix  to  help  correlate  the  large  amount 
of  collected  data,  it  is  feasible  to  use  much  more  data  within  a 
short  period  to  analyze  a  given  problem.  The  Clinical  Matrix  can 
be  applied  to  neural  networks  in  order  to  combine  the  experience 
of  several  physicians.  This  report  introduces  the  use  of  the 
Clinical  Matrix  as  it  applies  to  the  diagnosis  of  symptoms 
relating  to  various  psychiatric  disorders.  Back-Propagation  was 
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chosen  for  the  Clinical  Matrix  due  to  its  adaptability,  learning, 
and  recall  characteristics.  Learning  is  the  ability  of  the 
network  to  modify  the  connecting  weights  in  response  to  stimulus 
presented  at  the  input.  This  could  also  be  considered  to  be  the 
teaching  or  training  of  the  network.  Recalling  the  network  will 
process  the  stimulus  presented  to  the  network's  input  and 
calculate  a  response  at  the  output. 

IZ.  DESCRIPTION  OF  BACKPR0PA6ATI0M  NETWORK 
Neural  networks  are  well  suited  for  the  Clinical  Matrix 
application.  Neural  networks  have  the  ability  to  learn,  and  they 
can  be  used  to  interpret  data  and  calculate  a  proper  response  to 
give  a  desired  output.  Also,  the  Clinical  Matrix  allows  the 
network  to  provide  the  desired  output  even  when  the  input  is 
imprecise  (ie.  different  physicians  place  different  weights  on 
various  symptoms) .  since  neural  networks  work  better  with 
greater  amounts  of  data,  the  Clinical  Matrix  lends  itself  to 
using  large  amounts  of  clinical  data  gathered  from  many 
physicians.  Networks  rely  on  hidden  layers  and  output  layers 
with  a  specified  learning  rule  and  a  transfer  function  that 
adjusts  the  output  data  to  what  is  required  (Figure  1)  The  way 
in  which  a  network  is  trained  depends  upon  the  type  of  transfer 
function  and  learning  rule  used  with  the  network.  The  transfer 
function  is  used  in  conjunction  with  the  PE's  (processing 
elements)  and  determines  the  way  in  which  data  is  propagated  from 
input  to  output.  The  various  learning  rules  determine  the  way  in 
which  data  is  summed  and  how  error  is  handled  to  adjust  the 
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connttctlon  weights.  The  weights  are  recalculated  to  bring  the 
actual  output  to  within  an  acceptable  accuracy  of  the  desired 
output.  Processing  elesents  are  used  to  connect  the  input  and 
output  functions.  A  processing  element  can  be  considered  as  a 
form  of  data  input  and  data  output.  Each  PE  has  associated  with 
it  a  path  connecting  itself  to  a  previous  layer  and  to  the  next 
layer  in  the  network  (Figure  1)  The  transfer  function  we  used 
in  the  Clinical  Matrix  is  the  Sigmoidal  function.  The  Sigmoidal 
transfer  function  is  a  continuous  mapping  of  input  data  into  a 
value  between  zero  and  one:  i.e. 

T=(l+e‘*)‘‘  [1]  where 

T  is  the  result  of  transfer  function.  The  function  a 
is  given  by  d 

a  =  J]  Wj  *  Xj  [2] 

j=l 

where  d  is  the  total  number  of  diseases  under  consideration 

denotes  the  elements  of  the  clinical  matrix  in  Table 

i; 

i  is  the  row  index  label  of  symptoms  (1  to  33); 
j  is  the  column  index  label  of  psychotic  disorders  (1 
to  7)  ; 

and  Xj  is  initially  a  set  of  random  numbers  between  0 
and  1. 

Figure  1  is  a  graphical  presentation  showing  the  input  PE's 
(signs  and  symptoms) ,  the  hidden  layer  PE's,  and  the  output  PE's 
(psychiatric  disorders)  along  with  the  associated  path  which 
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connects  the  three  layer's.  The  figure  demonstrates  a  33xMx7 
matrix,  where  M  can  equal  6  thru  15.  For  the  purpose  of  this 
study,  6  to  15  processing  elements  in  the  hidden  layer  produced 
equivalent  results  within  the  same  amount  of  time. 

III.  PSYCHIATRIC  DISORDERS  USED  FOR  STUDY 
The  symptoms  for  several  psychiatric  disorders  concerning 
schizophrenia  can  be  included  in  a  clinical  matrix.  These 
symptoms  are  shown  in  Table  1.  This  table  was  derived  from 
Noyes'  Modern  Clinical  Psychiatry.  [6]  The  33  psychiatric 
symptoms  were  given  according  to  the  severity  of  the  symptom  for 
each  psychotic  disorder.  The  input  data  presented  to  the 
network  is  shown  in  Table  2.  In  formulating  his  concept  of 
Psychotic  Disorders,  Kraepelin  classified  his  cases  into 
different  varieties,  depending  on  the  predominant  symptomatology. 
Although  classification  according  to  reaction  type  continues  to 
be  made,  it  must  be  recognized  that  numerous  patients  show  at  one 
time  or  another  psychopathology  characteristic  of  the  individual 
groups.  From  the  Clinical  Matrix,  it  is  obvious  that  while  some 
psychological  disorders  exhibit  the  same  symptoms,  they  vary  in 
the  degrees  of  severity.  What  makes  the  use  of  neural  networks 
so  beneficial  in  this  study  is  that  the  data  can  vary  slightly 
between  physicians,  for  instance,  and  due  to  the  attributes  of 
the  Neural  Network  the  same  diagnosis  will  be  obtained.  Shown  in 
Table  3  is  the  test  data  that  was  input  into  the  network  for 
training  and  the  response  that  was  diagnosed.  Although  the  data 
included  some  variation  of  symptoms  that  are  common  to  other 
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disorders,  and  differing  weights  the  network  was  able  to 
determine  the  proper  response  for  all  pyschotic  disorders.  This 
test  data  demonstrates  the  robustness  of  the  networks  capability 
to  produce  a  reliable  output  within  an  allowable  tolerance,  in 
this  case  10  percent. 

IV.  CONCLUSION 

Applying  the  Clinical  Matrix  to  interpret  the  data  obtained 
during  normal  clinical  interviews  could  enhance  the  diagnosis  of 
the  subjective  Interpretation  of  data.  The  knowledge  of  a  larger 
number  of  physicians  should  be  included  in  any  diagnosis. 

Various  psychiatric  disorder  interview  techniques,  such  as  the 
Psychiatric  Epidemiology  Research  Instrument  (PERI)  and  the 
Structured  Clinical  Interview  for  DSM-III-R:  Psychotic  Disorders 
(SCID-PD)  [5],  would  likely  benefit  from  using  the  Clinical 
Matrix  technique.  The  PERI  and  the  SCID-PD  included  a  survey  of 
homeless  men  which  were  screened  and  diagnosed  for  psychotic 
disorders.  The  interviewer  who  screened  the  homeless  were  mainly 
graduate  students  in  psychology  and  social  work.  The  large 
amounts  of  data  produced  in  this  study  is  best  handled  by  the 
clinical  matrix.  The  advantages  of  this  methodology  include  the 
speed  of  interpreted  data,  the  low  cost  of  Personal  Computers  and 
also  the  user  friendliness  of  a  properly  designed  computer 
program. 
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PSYCMATWC  SYMPTOMS  CONCERMNQ  PSYCHOTIC  MSOIIDERS 


CATATONIC 

CATATONIC 

STUPOR 

EXCITEMENT 

SIMPLE 

HEBEPittEHIC 

PARANOID 

SCHI20- 

SCHI20* 

SCHIZO- 

SCHIZO* 

SCHIZO- 

INVOLUTIONAL 

MANIC 

PNRENIA 

PHRENIA 

PHRENIA 

phrehia 

phrehia 

MELANCHOLIA 

pepressive 

«1EF 

0.3 

SAOMESS 

0.1 

0.2 

ANXIETY 

0.5 

0.3 

SHANE 

0.4 

0.3 

QUILT 

0.6 

0.2 

HELPLESSNESS 

0.3 

0.2 

0.8 

0.4 

SELF-ESTEEM 

0.3 

0.8 

0.3 

0.3 

0.2 

INHIBITION 

0.4 

0.7 

0.6 

0.5 

0.7 

0.4 

OF  PERSONALITY 

INSOMNIA 

0.3 

SUSPICIOUS 

0.7 

0.4 

DOUBT 

0.4 

0.6 

0.2 

INTEREST 

0.2 

0.3 

0.3 

0.4 

EATING  HABITS 

0.6 

0.1 

0.9 

0.4 

LOOSES  HEIGHT 

0.6 

0.4 

RAGE 

0.7 

0.7 

AFFECTIVE 

0.7 

0.4 

0.7 

REGRESSION 

HEALTH  SWINGS 

0.4 

0.7 

SLEEPS  POORLY 

0.7 

0.5 

0.6 

FATIGUE 

0.2 

ATTENTION 

0.1 

0.3 

0.1 

0.1 

HALLUCINATIONS 

0.2 

EMOTION 

0.2 

0.6 

DISORDER 

0.2 

MOODY 

0.5 

INDIFFERENCE 

0.3 

0.2 

0.4 

0.4 

EMOTION  SWINGS 

0.2 

0.2 

0.3 

RESPONSIBILITY 

0.2 

SILLINESS 

0.5 

0.5 

SPEECH 

0.4 

0.4 

INCOHERENT 

0.6 

0.6 

STUPOR 

AUTISTIC 

0.4 

0.4 

0.7 

LIFE 

PERSONNEL 

0.1 

0.1 

0.7 

HYGIENE  SWINGS 

REACTION  TO 

0.2 

PAINFUL 

STIMULI 

TABLE  1 


INPUT  PROCeSSXNO  CCCHENTS  CPC*«> 
SXCNS  AND  SYMPTOMS  AS  PER  TASLC  i 
<SS  NEURONS > 


FIGURE  1 
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TRAINING  DATA 


TRAMNO  SET  INI>UT  FlU  OESIREO  OUTPUT  RU  PSYCHOTIC  DISOROa 

0.0  0.0  0.0  CO  0.0  0.0  0.0  0.0  0.0  0.0 

0.4  0.2  0.6  0.0  0.0  0.0  0.0  0.0  0.0  0.1 

0.0  0.0  0.0  0.0  0.3  0.2  0.0  0.0  0.0  0.6  Catatonic  Stipor 

0.4  0.1  0.2  1.0  0.0  0.0  0.0  0.0  0.0  0.0  Sdilzophranla 

0.0  0.0  0.0  0.0  0.0  0.0  0.3  0.4  0.0  0.0 

0.0  0.0  0.1  0.6  0.0  0.0  0.0  0.7  0.0  0.0 

0.2  0.0  0.0  0.0  0.0  0.2  0.2  O.S  0.4  0.0  Catatoni ;  St^ior 

0.4  0.1  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  Scliizo|Arania 


0.0  0.0  0.0  0.0  0.0  0.3  0.0  0.7  0.0  0.0 
0.0  0.3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 


0.0  0.2  0.2  O.S  0.2  0.0  0.0  0.0  0.0  0.0  »iB|»a 

0.0  0.0  0,0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  OdtizacOrania 

0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.6  0.0  0.0 

0.0  0.0  0.0  0.0  0.7  0.7  0.0  0.0  0.0  0.0 

0.0  0.0  0.0  0.0  0.0  0.3  0.0  0.5  0.4  0.6  Hafaaphranic 

0.7  0.0  0.0  0.0  0.0  0.0  1.0  0.0  b.O  0.0  SchizofArania 

0.0  0.0  0.0  0.0  0.0  0.2  0.3  O.S  0.0  0.7 

0.6  0.3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.3 

0.0  0.6  0.0  0.0  0.4  0.0  0.0  0.0  0.0  0.0  Paranoid 

0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  Schizofihrenia 

0.3  0.1  0.5  0.4  0.6  0.8  0.3  0.7  0.3  0.4 

0.2  0.4  0.4  0.4  0.0  0.4  0.4  O.S  0.0  0.1 

0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  Involutional 

0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  t.O  0.0  Nclandiolia 

0.0  0.2  0.3  0.3  0.2  0.4  0.2  0.4  0.0  0.0 

0.0  0.0  0.0  0.0  0.7  0.7  0.7  0.6  0.2  0.1 

0.0  0.0  0.0  0.0  0.4  0.0  0.0  0.0  0.0  0.0  Nanlc 

0.0  0.7  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  Oapraaalva 


TASU2 

TEST  VERIFICATION  DATA 


TEST  INPUT  DATA  ACTUAl  OUTPUT  RU  PSYCHOTIC  WSOROER 


0.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 

0.3  0.1  0.4  0.2  0.0  0.0  0.0  0.0  0.0  0.3 

0.0  0.0  0.0  0.0  0.3  0.2  0.0  0.0  0.2  0.6 

0.4  0.2  0.2 

.S3  .04  .02  .02  .09  .00  .01 

Catatonic  Stupor 
Sdiizophrania 

0.0  0.0  0.2  0.0  0.0  0.0  0.2  0.5  0.0  0.0 
0.0  0.3  0.3  0.6  0.0  0.0  0.0  0.6  0.0  0.0 
0.2  0.0  0.0  0.0  0.0  0.2  0.2  0.5  0.4  0.0 
0.4  0.1  0.0 

.01  .M  .02  .02  .01  .03  .00 

Catatonic  Stiswr 
Schizophranta 

0.0  0.0  0.0  0.0  0.0  0.3  0.0  O.S  0.0  0.0 
0.0  0.3  0.0  0.0  0.0  0.0  0.7  0.0  0.0  0.0 
0.0  0.2  0.2  O.S  0.2  0.0  0.1  0.0  0.0  0.0 
0.0  0.0  0.0 

.02  .01  M  .00  .02  .02  .06 

Siapla 

Sdiizophrania 

0.0  0.0  0.0  0.0  0.0  0.0  0.7  0.6  0.0  0.0 
0.0  0.0  0.9  0.0  0.6  0.8  0.0  0.0  0.0  0.0 
0.0  0.0  0.2  0.0  0.0  0.3  0.0  0.5  O.S  0.6 
0.7  0.0  0.0 

,02  .02  .00  .87  .00  .01  .02 

Hcbaphranie 

Sdiizophrania 

0.0  0.0  0.0  0.0  0.0  0.3  0.3  0.6  0.0  0.4 
0.6  0.3  0.0  0.0  0.4  0.0  0.0  0.0  0.0  0.3 
0.0  0.6  0.0  0.0  0.4  0.0  0.0  0.0  0.0  0.0 
0.0  0.0  0.0 

.02  .01  .03  .00  .SO  .02  .03 

Paranoid 

Schizophrania 

0.3  0.1  0.5  9.4  0.6  0.8  0.3  0.7  0.3  0.4 
0.2  0.4  0.4  0.4  0.0  0.4  0.4  0.5  0.0  0.1 
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
0.0  0.0  0.0 

.00  .01  .04  .01  .02  .93  .02 

Involutional 

Helancholia 

0.0  0.0  O.S  0.3  0.2  0.4  0.2  0.4  0.0  0.0 
0.0  0.0  0.0  0.0  0.4  0.4  0.7  0.6  0.2  0.1 
0.0  0.0  0.0  0.0  0.4  0.0  0.0  0.3  0.0  0.0 
0.0  0.6  0.0 

.00  .02  .01  .01  .01  .03  .91 

•tonic 

Oapreaaive 

TABUS 
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Abstract 

Principal  components  analysis  has  been  a  valuable  tool  for  statistics,  signal  processing,  and  AI  ^plications. 
The  process  involves  the  finding  of  the  eigenvectors  of  the  lumped  covariance  matrix  of  a  statistical  sample. 
We  discuss  here  the  undertaking  of  computing  the  priiKipal  components  with  an  iterative  method  bas^  on 
neural  nets.  The  method  consists  of  training  a  feed  forward  neural  net  structure.  The  goal  which  the  training 
attempts  to  attain  relates  to  Fisher’s  measure  for  linear  discriminants  and  so  the  principal  componerus  are  at¬ 
tractors  for  die  convergence  of  the  training  method. 


1.  Introduction 

In  pattern  recognition  methods,  objects  are  usually  represented  as  veaors  of  attribute  values.  Each  attri¬ 
bute  is  usually  a  feature  which  is  deemed  pertinent  to  the  recognition  task.  The  set  of  such  features  which  is 
to  be  considered  is  determined  a  priori  and  with  trial  and  error  methods  since  it  is  not  always  known  which 
features  exactly  are  necessary  or  sufficient  Thus  the  vector  representations  of  objects  may  be  long  and  the 
corresponding  vector  space  (in  terms  of  which  the  object  space  is  expressed)  may  be  of  hi^  dimensionali^. 
Thus  it  is  necessary  to  determine  alternative  representations  for  the  object  space,  that  is,  in  terms  of  a  different 
vector  space  or  a  different  coordinate  system  of  as  low  dimensionality  as  possible.  The  idea  is  roughly  to  find 
a  subspace  of  the  original  vector  space  of  the  least  dimensionality  in  whicdi  the  various  projected  vectors  are 
still  distinguishable  in  the  proper  objea  classes.  This  is  the  idea  behind  the  Karhunen-Loeve  transformation, 
or  the  principal  components  analysis.  It  has  been  useful  in  projecting  vector  spaces  of  high  dimensionality  into 
other  spaces  of  lower  dimensionality  while  at  the  same  time  retaining  as  much  of  die  discriminant  information 
content  of  the  original  space. 

The  process  to  compute  the  principal  components  involves  costly  matrix  operations  and  a  number  of 
efforts  have  been  undertaken  to  compute  them  by  alternative  methods.  Such  an  alternative  method  based  on  a 
rather  simple  neural  network  structure  is  discussed  here.  It  is  instructive  to  provide  an  outline  of  the  approach 
here  before  getting  into  details. 

Fisher’s  method  [2,3,5]  is  based  on  the  optimization  of  a  certain  measure  which  reflects  certain  restric¬ 
tions  oa  the  overall  statistics  of  the  samples  in  the  projection  space  (the  new  vector  space  which  is  seeked). 
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The  solution  to  the  (qttiinizatitxi  pioHem  is  exactly  the  eigenvectors  of  the  lumped  covariance  matrix  of  the 
samples  Ihe  idea  pursued  here  is  to  train  a  feed-forward  neural  net  witti  a  training  objective  that  is  similar  to 
Fisher's  measure.  If  this  objective  is  attained  then  the  solution  eigenvectors  should  be  reflected  in  the  trained 
net  structure  after  the  training  process  has  converged. 


2.  The  network  based  method 


Let’s  assume  diat  we  have  a  feed-forward  net  in  which  we  feed  input  vectors  Xi  and  receive  cone^nd- 
ing  ouqwt  vectors  7.- .  The  Fj 's  are  functions  of  the  X,-  *s  and  of  the  weights  associated  vrith  the  net  The 
weights  in  particular  are  parameters  which  determine  the  net's  transfer  function.  As  these  weights  change  the 
¥( 's  move  aroiuid.  that  is,  they  represent  points  which  move  within  the  output  space.  Suppose  rtow  that  the 
Xi 's  ate  partitioned  in  object  classes  and  suppose  that  we  would  like  to  identify  the  network  weights  which 
vriU  attain  the  following  goal.  In  a  tt^logical  sense,  we  would  like  outputs  which  cortespor^  to  inputs  of 
same  class  to  be  close  to  each  other  (in  terms  of  Eucledian  distance),  while  outputs  of  different  class  to  be  as 
far  apart  as  possible.  Thus,  we  do  not  requite  that  ou^ts  approximate  any  specific  target  values  nor  any 
specific  distribution.  Ouqxits  are  free  and  can  attain  my  values  as  long  as  they  attain  this  aggregate  clustering 
constraint  A  similar  criterion  has  been  used  before  for  various  purposes,  like  in  the  Coulomb  energy  net  and 
in  hybrid  nets  [1,7,9]  It  turns  out  that  this  criterion  is  significant  in  many  ways.  Now,  let’s  assume  that  we 
can  device  a  measure  G(W),  vrith  the  net’s  weights  W  as  parameters,  and  which  reflects  the  above  described 
goal.  That  is,  we  assume  that  G(W)  is  a  measure  of  the  "goodness"  of  the  topological  distribution  of  the  Fj ’s 
(with  respect  to  the  above  goal).  Then  we  can  come  up  with  a  training  algorithm  for  identifying  die  weights 
W  using  G(W)  as  the  energy  function. 


There  ate  a  few  options  regarding  an  af^ropriate  measure  G(W).  We  can  device  a  distance  measure 
D(Xi,Yj)  (positive  of  course!)  on  pairs  of  output  vectors  and  then  use: 


G^j:aijD(Xi,rj), 

•J 


where  Ofy  = 


1 

-1 


if  class  (Xi)- class  {Xj) 
if  closs(Xi)*class^j) 


(1) 


If  Oij  is  positive  then  the  corresponding  term  D(Ji,Yj)  contributes  positively  to  G  otherwise  it  contributes 
negatively.  So  C  is  minimized  if  the  positive  terms  become  very  small  (output  F,-  vectors  of  same  class  mov¬ 
ing  close  together),  while  at  the  same  time  negative  terms  become  large  (output  F,  vectors  of  different  class 
moving  apart).  The  obvious  way  to  the  optimization  of  G  would  be  a  gradient  descent  thus  the  weight  param- 

eters  should  change  according  to  the  delta  rule;  AW  =-X^- 

Let’s  now  turn  to  a  short  review  of  Fisher’s  linear  discriminant  method.  A  sutistical  sample  of  classified 
vectors  of  some  vector  space  is  given  The  method  seeks  to  identify  a  direction  (or  a  line)  within  the  vector 
space.  The  goal  is  that  the  projections  of  the  given  sample  set  on  this  direction  cluster  in  as  best  discriminated 
class  sets  as  possible.  Dqiending  on  the  choice  of  this  direction  the  projections  of  the  various  class  subsets  of 
the  sample  may  overlap  or  may  fall  in  well  separated  regions.  Often,  there  may  not  exist  a  direction  such  that 
the  projections  of  the  various  classes  are  well  separated  and  more  or  less  partial  overlaps  are  inevitable.  The 
goal  of  Fisher’s  method  is  to  fmd  the  best  discriminant  direction  The  problem  is  formulated  as  an  optimiza¬ 
tion  problem  where  the  measitre  to  be  optimized  is: 


Sum  of  squared  distances  of  the  mean  projections  of 

all  classes  from  the  projection  of  the  overall  mean  ^2) 

Variance  of  all  sample  projections 

The  directions  which  optimize  this  measure  in  a  certain  order  are  the  eigenvectors  which  corrcspond  to  the 
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eigenvalues  of  the  lumped  covariance  matrix  of  die  sample.  What  is  interesting  is  that  this  measure  is  compa- 
tiUe  with  the  one  described  in  the  transfonning  neural  net  above.  If  in  that  net  we  set  the  neuron  activation 
fiinctimis  to  be  linear,  that  the  output  of  each  neuron  is  the  projection  of  the  neuron’s  input  vector  in  the  direc¬ 
tion  of  the  neunm’s  weight  vector.  If  we  apply  the  training  method  described  above  cm  a  single  linear  neuron, 
then  Hsher’s  solution  should  be  an  attractor  of  die  convergence  of  the  (iterative)  training  process. 


(18, n 


(11.0) 


Figure  1.  Experimentation  on  a  simple  X-OR  set. 

We  used  the  X-OR  problem  as  one  of  the  standard  benchmarks.  Inputs  00  and  1 1  are  supposed  to  pro¬ 
duce  a  target  of  0  and  inputs  01  and  10  are  supposed  to  produce  a  target  of  l.  The  activation  limction  of  the 
neuron  was  set  to  f=W*X.  The  process  was  converging  to  one  of  the  vectors  oriented  at  (1,1)  and  (1,-1)  as 
shown  in  figure  1.  Both  of  them  are  equally  good.  The  reason  is  simple;  the  best  the  net  could  do  was  to 
map  the  ir^iut  vectors  of  a  single  class  to  a  single  output  vector  and  keep  the  outputs  for  the  vectors  of  the 
other  class  apart  from  that  one.  The  end  result  depended  on  the  initial  setting  of  the  neuron’s  weight  vector. 
'Then  we  used  2  linear  neurons  in  a  single  layer.  Since  the  two  neurons  operate  in  parallel  without  exchanging 
any  information,  each  one  recreates  the  earlier  behavior  of  a  single  neuron  independently  from  each  other.  The 
resulting  weight  vectors  would  end  up  in  the  directions  (1.1)  and  (1,-1)  independently  of  each  other  and  often 
both  weight  vectors  ended  up  in  the  same  direction  (depending  on  initial  values).  So  we  introduced  a  lateral 
virtual  interactirm  from  one  neuron  to  the  second.  The  interaction  was  one  way  from  the  first  to  the  second. 
Thus,  the  first  neuron  operated  freely  in  trying  to  optimize  G(W)  whereas  the  second  under  a  restriction.  The 
second  neuron  had  to  optimize  the  G(W)  funaion  augmented  with  the  correlation  of  the  two  weight  vectors;  so 
we  used  C  +Wj  W 2,  thus  attempting  to  reduce  also  the  correlation  between  the  weight  vectors.  As  a  result  we 
consistently  got  the  above  principal  components  iixlependently  of  initial  weight  values. 
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To  conclude,  we  should  provide  some  remarks  regarding  the  dioice  of  the  distance  measure 
There  are  a  few  possil^ties  for  its  choice  such  as: 


D(Yi.Yj)== - ^ ^ 

^  \Yi-Yj\^ 


Y:\Yi-Yj\^ 

g=4 - 

XiTi-r^i" 


G=pyi-r,|2 

•/ 


"  IT.-Fyl 


(3) 

(4) 

(5) 

(6) 


However,  tire  best  choice  of  all  of  the  above  was  to  use  D(Yi,Yj)= — 


1 


We  found  most  of  the  rest  to 


\Yi-Yj\^' 

present  some  technical  difficulties,  and  it  turned  out  that  they  were  not  actually  capmring  or  representing  the 
goal  which  the  ou4Hit  distribution  is  supposed  to  attain.  For  example,  the  choice  of  \Yi-Yj\^  will  work  well 
until  same  class  ouqjut  vectors  come  sufficiently  close.  Then  their  contribution  becomes  very  small  relatively 
to  the  contribution  of  the  rest  terms  of  dissimilar  pairs.  So  the  process  tends  to  sacrifice  the  "shrinking”  of 
each  class  in  favor  of  bringing  classes  apart  from  each  other.  So  if  the  change  in  weights  could  bring  classes 
further  ^>ait  at  the  cost  of  somewhat  spreading  a  certain  class  then  it  would.  The  choice  of  (4)  was  best  since 
it  seems  to  balance  these  tradeoffs  better.  We  also  used  the  later  form  with  a  limiter  function  (sigmoid) 
applied  on  the  pairwise  distance  and  obtained  faster  cmivergence  rates  but  the  ultimate  results  appeared  rather 
insensitive. 


3.  Conclusion 

We  intented  here  to  point  out  the  ^parent  relation  between  the  principal  components  analysis  and  a 
ireural  network  training  method.  The  significance  of  this  relation  lies  in  the  potential  to  compute  the  principal 
components  by  means  of  a  contrectionist  method.  There  are  various  alternatives  associated  with  this  approach 
and  we  provided  our  insight  with  respect  to  certain  choices  as  this  insight  emerges  from  our  preliminary  exper- 
imentatioa 
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Introduction 

The  human  pattern  recognition  system  has  the  property  that  it  is  able  to  identify  salient 
features  of  arbitrary  patterns  in  a  parallel  manner  and  to  do  so  in  the  presence  of 
masking  noise.  The  question  of  which  features  are  regarded  as  salient  may  depend  on 
the  definition  of  salience;  e.g.,  features  having  greater  contrast,  greater  overall  area  or  a 
greater  frequency  of  similar  examples  than  the  less  salient  features.  To  match  such 
properties  of  human  pattern  recognition,  we  developed  an  algorithm  based  on  the 
pattern's  autocorrelation  function  (ACF). 

The  second  order  ACF  is  represented  by  a  2D  array  defining  the  self-similarity  of  the 
pattern  at  all  (x,y)  displacements;  each  entry  in  the  array  reflects  the  similarity  l^tween 
the  pattern  and  a  translation  of  it.  The  third  order  ACF  is  represented  by  a  4D  array, 
each  of  whose  entries  reflects  the  similarity  among  the  pattern  and  two  translations  of 
it.  Thus,  successive  orders  of  the  ACF  form  a  h)rperspace  of  contingent  self-sunilarities. 

By  seeking  high  valued  entries  of  the  Nth  order  ACF,  one  can  uncover  N-point  features 
that  are  strongly  represented  in  the  overall  pattern.  However,  because  of  the  size  of  the 
array  involved,  (viz.,  (m.n)*^‘*  entries  for  an  m  x  n  pattern  array),  direct  extraction  of 
high  valued  entries  is  intractable  for  all  but  the  lowest  orders  for  patterns  of  useful  size. 

Our  solution  to  the  Nth  order  problem  was  to  develop  a  procedure  for  tracking  the 
trajectory  through  the  Nth  order  ACF  by  means  of  a  sequence  of  N-1  constrained 
maximizations  for  each  intermediate  ACF.  In  this  way,  each  order  was  reduced  to  a 
single  point  in  flie  projection  to  the  pattern  array.  This  solution  has  the  advantage  that 
computation  grows  linearly  with  N  as  opposed  growing  with  the  array  size  to  the 
power  of  N.  It  has  the  disadvantage  that  it  may  well  not  find  globally  maximal 
subpattems.  However,  this  disadvantage  is  more  than  offset  by  the  algorithm's  ability 
to  ^d  subpattems  corresponding  to  perceptually  interesting  features. 


Methods. 

To  obtain  the  most  salient  feature  in  the  image,  the  algorithm  constructs  a  2D  projection 
of  the  of  the  higher  order  ACF  at  non-zero  shifts  by  tracking  the  trajectory  of  a  maximal 
non-zero  ACF  value  through  the  orders.  Of  course,  this  procedure  is  not  unique.  The 
particular  constraints  applied  to  the  definition  of  the  maximum  found  at  each  order 
determine  the  properties  of  the  algorithm.  The  shift  at  each  order  defines  the  position  of 
its  representation  in  the  2D  trajectory  projected  onto  the  pattern  array,  as  depicted  for  a 
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Hg.  1.  Depiction  of  autocorrelation  trajectory  analysis: 

(a).  Meshplot  of  luminance  profile  of  a  test  unage  consisting  of  4  bars  of  the  same 
orientation; 

G?).  Meshplots  of  the  2nd,  3rd  and  4th  order  ACF  surfaces.  The  central  point  for 
zero  shift  in  the  2nd  order  ACF  has  been  set  to  zero,  so  the  highest  point  is  one  pixel 
away.  The  3rd  order  ACF  at  this  point  is  then  computed,  setting  the  mghest  point  from 
the  Ind  order  (and  its  reflection;  to  zero.  The  trajectory  then  progresses  to  the  next 
highest  point,  which  forms  the  basis  for  the  4th  order  ACF  at  the  nxed  shifts  for  2nd 
and  3rd  order. 


r 


specific  base  image  in  Fig.  1.  To  limit  ih  solutions  to  localized  images,  only  correlated 
points  adjacent  to  the  solution  at  the  pi^ivious  order  were  included  (as  contrasted  with 
the  unconstrained  trajectory  of  Fig.  1).  The  optimum  correlation  at  each  order  could  be 
adjacent  to  the  positions  from  any  of  the  previous  orders,  not  just  the  immediately 
preceding  order.  The  trajectory  therefore  forms  a  "tree"  of  adjacent  shift  positions 
defining  the  most  salient  compact  feature  in  the  image.  The  algorithm  is  nevertheless 
sensitive  to  repeated  examples  of  the  same  images  at  random  placements  because  the 
correlations  are  integrated  over  the  entire  image  space  in  the  ACF. 

There  is  one  free  parameter,  which  is  the  smallest  number  of  matching  conjunctions  to 
allow  before  terminating.  This  may  be  set  at  a  level  designed  to  terminate  the  trajectory 
before  the  point  where  it  is  likely  to  be  substantially  distorted  by  noise  contamination. 
The  terminating  conjunction  number  therefore  is  dependent  on  both  the  level  of  noise 
in  the  image  and  the  probability  criterion  at  which  noise  contamination  is  to  be 
excluded. 


Results. 

The  images  consisted  of  a  small  number  (5-10)  identical  target  subunits  (e.g.,  an 
alphabet  letter)  and  an  equal  number  of  distractors  (e.g.,  letters  different  from  each 
other  and  from  the  target).  The  algorithm  always  was  able  to  find  the  largest  and/or 
most  common  pattern  subunit  when  subunits  were  scattered  randomly  through  the 
image  without  overlap.  Allowing  overlap  between  the  subunits  dramatically  impaired 
humans'  ability  to  distinguish  the  subimits,  but  the  algorithm  still  was  able  to  perform 
well  up  to  high  levels  of  masking  by  overlap  when  the  terminal  conjimction  parameter 
was  varied  to  compensate  for  the  noise  introduced  by  the  overlap  of  the  feahires.  The 
statistical  probabilities  controlling  the  terminal  conjunctions  parameter  need  to  be 
determined  in  order  to  optimize  performance  at  a  particular  level  of  masking  noise, 
however. 

Examples  of  the  typical  performance  of  the  algorithm  are  shown  in  Figs.  1-3.  Repeated 
examples  of  a  letter  R  were  scattered  at  random  in  Fig.  1  A.  (The  pattern  of  6  Rs  is 
repeated  over  two  cycles  in  each  direction  because  the  algorithm  treated  the  pattern  as 
an  edgeless  torus.  Thus,  the  pattern  should  be  viewed  as  a  through  window  with  the 
dimension  of  one  repetition  cycle.)  Despite  the  overlaid  letters  and  their  random 
placement,  the  algorithm  can  extract  the  target  pattern  element  without  error. 

In  Fig.  2,  a  random  target  pattern  of  4  Ss  is  overlaid  by  a  mask  of  6  other  letters 
(B,C,D,E,G  &  H)  placed  at  random.  Human  observers  can  identify  the  S  as  the  most 
dominant  pattern  element  with  little  trouble.  The  algorithm  extracts  the  features  of  the 
S  but  also  detects  a  tail  introduced  by  the  configuration  of  the  masking  letters.  Such 
defects  differ  with  each  random  placement  of  the  targets  and  masks. 

Fig.  3  shows  a  similar  example  with  4  Ls  in  the  same  set  of  maskers,  but  one  in  which 
the  masker  configuration  makes  it  harder  for  human  observers  to  extract  the  dominant 
pattern  element  without  prior  knowledge.  The  algorithm  extracts  the  whole  letter  but 
picks  up  additional  elements  from  the  masking. 
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(a) 


(b) 


Fig.  2.  ACF  trajectory  analysis  without  maskers:  (a).  Six  examples  of  the  letter  R  were 
scattered  at  random.  The  pattern  is  repeated  over  two  cycles  in  each  direction  and 
should  be  viewed  as  a  through  window  with  the  dimension  of  one  repetition  cycle;  (b). 
The  ACF  trajectory  for  U)- 

^  (a)  (b)  (c) 


Fig.  3.  ACF  trajectory  analysis  with  masking  patterns:  (a).  A  random  target  pattern  of  4 
Ss  in  the  two-cycle  repeat  format  of  Kg.  2;  (b).  The  pattern  in  (a)  is  overlaid  by  a  mask  of 
6  other  letters  (BADXg  &  H)  placed  at  random,  (c).  The  ACF  trajectory  for  (b). 

(a)  (b)  (c) 


Fig.  4.  ACF  trajectory  analysis  with  a  strong  masking  configuration  (a).  A  random 
target  pattern  of  4  Ls  in  the  two-cycle  repeat  format  ofFig.  2;  (b).  The  pattern  in  (a)  is 
overlaid  bv  a  mask  of  6  other  letters  (B,C,D,E,G  &  H)  placed  at  random;  (c).  The  ACF 
trajectory  tor  (b). 
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Discussion. 


The  results  illustrate  that  the  ACF  trajectory  approach  is  superior  to  the  2nd  order  ACF 
itself  because  the  trajectory  is  sensitive  to  the  plUse  relations  in  the  pattern,  whereas  the 
ACF  is  phase  insensitive. 

The  ACF  trajectory  can  be  constrained  in  various  ways  other  than  that  of  adjacency  to 
any  previous  position  in  the  trajectory.  If  it  is  known  that  the  target  patterns  are  one¬ 
dimensional  "snakes”,  for  example,  it  could  be  constrained  to  include  only  points 
adjacent  to  solution  for  the  immediately  preceding  ACF  order,  with  a  corresponding 
increase  in  search  speed.  If  non-localized  patterns  were  sought,  the  adjacency 
constraint  could  be  weakened  or  abolished,  with  a  consequent  increase  in  search  time. 

The  approach  to  pattern  analysis  through  ACF  trajectories  is  a  viable  model  for  human 
pattern  recognition  because  it  could  be  implemented  with  local  parallel  processes  in  a 
neuronal  network.  Because  it  has  global  cooperative  properties  in  the  way  common 
elements  reinforce  each  other  at  a  distance,  the  ACF  trajectory  provides  interesting 
insights  into  the  potential  mechanism  by  which  humans  might  process  such  patterns. 


Conclusion. 

The  analysis  of  autocorrelation  trajectories  provides  an  efficient  means  for  identification 
of  salient  features  in  arbitrary  patterns  and  a  intriguing  model  for  parallel  pattern 
analysis  by  the  human  brain. 


Supported  fry  NIMH  grant  ttMH49044. 
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Abstract 

This  is  the  next  step  of  work  reported  in  [Lendaris,  Zwick  &  Mathia,  1993].  The  objective  has  been  to  develop  a 
constructive  method  that  uses  certain  a  priori  information  about  a  problem  domain  to  design  the  starting  structure 
of  cm  artificial  neural  network  (ANN).  The  method  explored  is  based  on  a  general  systems  theory  methodology 
(here  called  GSM)  that  calculates  a  kind  of  structural  information  of  the  problem  domain  via  analyzing  I/O  pairs 
from  that  domain.  A  modularized  ANN  structure  is  developed  based  on  the  GSM  information  provided.  The 
notion  of  performance  subset  (PS)  of  an  ANN  structure  is  described,  and  extensive  experiments  on  3-input,  1- 
output  Boolean  mappings  indicate  that  the  resulting  modularized-ANN  de^n  is  'conservative'  in  the  sense  that 
the  PS  of  the  modularized  ANN  contains  at  least  ali  the  mappings  included  in  the  GSM  category  used  to  design  the 
ANN.  Partial  experiments  on  5-input,  l-output  Boolean  Junctions  indicate  further  success.  The  extended 
experimental  results  also  suggest  the  possibility  of  using  a  measure  of  the  learning  curve  of  specified  ANNs  on  a 
series  of  (m  this  case  Boolean)  functions  to  serve  as  a  proxy  measure  for  the  complexity  of  those  Junctions.  This 
proxy  measure  seems  to  correlate  well  with  a  measure  known  as  Boolean  Length.  Determining  a  Junction's 
Boolean  Length  is  a  non-trivial  undertaking;  perhaps  it  will  turn  out  that  training  an  ANN  on  the  Junction  and 
measuring  its  learning  experience  will  be  a  useful  measure  of  function  complexity,  and  easier  to  determine  than 
the  Junction's  Boolean  Length. 

1  Background 

In  the  General  Systems  Theory  literature,  there  is  a  method  we  refer  to  as  the  general  ^stem  method  (GSM) 
[Lendaris,  Zwick  &  Mathia,  1993]  which  provides  'structural  knowledge'  about  a  problem  via  a  particular  kind  of 
(information-theoretic)  analysis  of  data  from  that  problem  domain.  A  question  arises  as  to  whether  that  GSM 
structural  knowledge  can  be  used  as  a  priori  information  about  the  problem  to  assist  in  designing  an  artificial 
neiual  network  (ANN)  to  be  applied  to  that  problem.  In  the  above  reference,  we  presented  the  idea  of  using  the 
GSM  structural  knowledge  to  design  modularized  ANNs  to  learn  the  mappings  implicit  in  such  data  (in  our  case, 
I/O  pairs  of  a  Boolean  mapping),  and  the  results  of  some  preliminary  experiments.  Certain  predictions  were  made, 
and  verified,  about  the  potential  benefits  of  designing  an  ANN  in  this  way. 

The  results  rqwrted  here  are  based  on  an  extensive  set  of  experiments  based  on  a  four-variable  (nominal-data) 
structure.  In  GSM  notation,  this  is  designated  an  ABCD  structure.  In  our  case,  due  to  the  input/output  nature  of 
our  problem  context,  we  impose  the  notion  of  causality,  and  consider  this  a  3-input,  l-output  system,  as  shown 
schematically  in  Figure  la.  The  data  are  all  binary,  thus  this  ^stem  is  mathematically  expressed  as  a  Boolean 

function.  The  set  of  2  =  2S6  possible  Boolean  fimetions  (mappings)  for  such  a  system  has  been  widely  studied, 
and  much  is  known  about  them.  In  the  context  of  elementary  cellular  automata  (ECA)  for  example,  the  256 
mappings  are  grouped  into  88  equivalence  classes  [IS].  This  latter  knowledge  was  used  to  select  functions  with 
known  structural  properties  for  the  present  ANN  exploration. 

The  focus  here  is  on  relations  of  two  difierent  structural  types:  1)  non-decomposable,  and  2)  decomposable  into 
two  relations  with  one  shared  variable.  In  GSM  notation,  type  1)  is  expressed  as  an  ABCD  structure,  and  type  2) 
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as  an  ABD:ACD  stnicture.  We  define  D  to  represent  the  output  variable  of  our  causal  system,  and  the  ABD:ACD 
notation  represents  the  case  of  A  being  the  shared  variable.  Permutations  of  the  input  variables  A,B,C  generate  a 
number  of  different  but  topologically  equivalent  mappings.  In  GSM,  the  ABCD  structure  corresponds  to  a  relation 
of  maximum  complexity  ~  there  is  no  reduction  available  (within  this  framework).  The  ABO.-ACD  structure,  on 
the  other  hand,  represents  the  case  where  there  is  a  (partial)  decoupling  of  variables  possible:  that  is,  ABCD  may 
be  decomposed  into  two  sub-structures  ABD  and  ACD  which  are  not  further  decomposable.  These  two  sub¬ 
structures  are  said  to  share  variable  A. 

Consider  now  the  task  of  designing  an  ANN  to  perform  a  3-input,  1 -output  binary  mapping.  If  nothing  were 
known  a  priori  about  the  specific  mapping  to  be  learned,  then  a  typical  candidate  ANN  structure  to  put  into  the  box 
of  Figure  la  would  be  a  feed-forward  type,  with  perhaps  a  single  hidden  layer,  and  fully  interconnected  [for 
present  purposes,  we  call  this  a  'non-decomposed  structure').  On  the  other  hand,  if  a  priori  knowledge  were 
available  that  the  mapping  to  be  learned  was  of  the  ABD:ACD  type,  then  an  ANN  structure  that  would  take  into 
account  such  a  priori  information  is  shown  in  Figure  lb.  In  this  case,  we  decompose  the  hidden  layer  into  two 
sub-structures  (the  shaded  boxes)  which  are  not  further  decomposable.  The  number  of  inputs  to  each  sub-structure 
is  snutller  than  for  the  ANN  of  Figure  la.  [We  call  this  a  decomposed  structure,  or  alternatively,  a  modularized 
structure.)  For  a  3-input  system,  with  only  256  total  possible  mappings,  this  may  seem  trivial,  but  even  moving  up 
to  just  a  S-input  system,  the  ANN  related  implications  start  becoming  significant.  For  the  5-input  case,  the  total 
number  of  possible  maps  is  Order(billion)!  Even  for  seemingly  small  numbers  of  inputs,  it  is  physically  not 
tractable  to  build  ANNs  whose  performance  subspace  PS  (defined  below)  covers  the  entire  set  of  possible 
mappings,  so,  any  constraints  discovered  in  the  data  that  can  be  translated  into  correlated  constraints  on  the  ANN 
structure  would  be  most  welcome. 

2  Notation 

We  start  with  a  characterization  of  the  ANN  as  a  "black  box"  that  performs  a  mapping  of  its  inputs  to  its  outputs. 
Once  the  inputs  and  outputs  are  defined,  conceptually,  there  exists  a  set  of  all  possible  mappinp  (SAPM)  from 

the  input  domain  to  the  output  range  [e.g.,  for  an  n-binary-input,  l-binary-output  context,  there  are  2^*  possible 
mappings).  For  each  ANN  structure  (inside  the  box)  with  a  given  setting  of  its  weight  values,  the  ANN  will 
perform  exactly  one  of  these  possible  mappings.  Doing  the  mental  experiment  of  scanning  all  possible  weight- 
value  combinations  in  the  given  ANN,  and  collecting  all  the  individual  mappings  performed  by  the  ANN,  we  call 
the  resulting  collection  of  mappings  the  ANN's  performance  subset  (PS)  [8)[6j.  (For  the  binary  case,  if  the 
number  of  inputs  to  the  ANN  exceeds  approximately  30,  it  would  be  physically  impossible  to  build  an  ANN  whose 

PS  contains  all  2  mappings,  hence  the  luune  subset.) 


*)  b)  d)  e)  f) 


Figure  1.  a)  Non-decomposed  and  b)  decomposed  Figure  2.  Set  of  all  possible  mappings  (SAPM) 

3-input,  1-output  system.  The  shaded  boxes  are  and  Performance  Subset  (PS)  [shaded  areas), 

implemented  as  separate  ANNs.  The  dot  represents  the  desired  mapping  (DM). 


In  Figure  2a,  we  symbolize  the  set  of  all  possible  maps  (SAPM)  as  the  region  defined  by  the  outermost  boundary, 
and  symbolize  the  PS  of  some  given  ANN  structure  as  the  region  defined  by  the  iimer  boundary  (shaded  area).  In 
Figure  2b,  we  add  a  point  DM  to  represent  the  mapping  corresponding  to  a  problem  we  wish  solved  (Desired 
Mi^rping)  .  By  our  ^finition  of  PS  as  the  collection  of  all  those  mappings  it  is  possible  for  the  given  ANN 
structure  to  perform  (over  the  set  of  all  its  weight  values),  it  is  not  possible  for  the  given  ANN  structure  to  perform 
the  mapping  DM  shown  in  Figure  2b.  Thus,  no  matter  what  weight-adjusting  algorithm  one  attempts  to  use,  it 
would  be  impossible  for  that  ANN  structure  to  ever  learn  the  mapping  at  DM.  So  what  is  the  ANN  designer  to  do? 
Several  strategies  suggest  themselves  for  what  might  be  done  either  before  training  and/or  during  training;  1) 
"move"  the  point  DM  until  it  is  inside  the  given  region  PS  (Figure  2c) ,  2)  "move”  the  region  marked  PS  until  it 
contains  the  desired  mapping  DM  (Figure  2d),  and/or  3)  increase  the  size  of  PS  until  it  contains  the  point  DM 
(Figure  2e).  Strategy  1  may  be  accomplished  by  the  designer  selecting  a  different  representation  schema  for  the 
inputs  and/or  outputs,  and  de  facto,  is  accomplished  any  time  a  designer  selects  a  representation  of  the  problem 
such  that  the  ANN  structure  the  designer  is  working  with  successfully  learns  the  desired  mapping.  Strategy  2  is 
accomplished  by  selecting  a  different  ANN  structure.  The  authors  are  aware  of  two  references  describing 
approaches  (that  appear  to)  use  this  strategy  on  line  (5](li]  (while  this  possible  strategy  was  discussed  even  in  the 
19^'s,  a  theoretic^  basis  for  such  an  approach  is  still  in  its  infancy).  Strategy  3  is  exemplified  by  the  variety  of 
methods  that  "grow"  the  starting  ANN  structure  during  training.  To  date,  most  training  strategies  assume  tliat  DM 
is  already  contained  within  the  PS  of  the  starting  ANN  structure  (Figure  2f),  and  the  job  of  the  training  algorithm 
is  to  converge  upon  DM  ~  indeed,  typical  convergence  theorems  state  that  a  solution  will  be  found  provided  it  ex¬ 
ists,  and  using  the  present  vocabulary,  this  says  provided  DM  is  contained  in  PS. 

The  present  paper  is  concerned  with  the  possibility  of  using  a  priori  information  about  the  problem  domain  to 
constructively  prestracture  an  ANN  with  assurance  that  its  PS  contains  the  desired  mapping  (Figure  2f),  and  in 
addition,  with  assurance  that  the  size  of  its  PS  is  relatively  small.  The  reason  for  the  latter  desire  is  that  if  a  given 
ANN  structure  learns  the  training  data,  then  the  smaller  its  PS,  the  better  its  chance  for  good  generalization 
performance  {8][1].  This  latter  property  is  the  obj«:tive  pursued  by  those  methods  that  do  weight  'pruning' 
(14][12];  they  start  with  an  ANN  structure  whose  PS  is  large  enough  to  assure  inclusion  of  their  DM,  and  then 
shrink  the  size  of  the  PS  in  a  principled  way,  making  it  smaller  and  smaller  just  to  the  stage  before  it  no  longer 
contains  their  DM. 

3  Experiments 

A  set  of  experiments  was  performed  on  the  2S6  possible  3-input,  1 -output  Boolean  functions,  using  the  partitioning 
into  88  equivalence  classes  mentioned  in  Section  1.  A  consistent  exemplar  of  each  of  the  classes  was  selected, 
yielding  a  set  of  88  functions,  and  then  1760  experiments  were  run  on  these  88  functions  [10].  Feedforward  ANNs 
with  back-propagation-of-error  training  were  used.  All  experiments  with  each  structure  type  used  the  same 
starting  state,  and  the  same  training  parameters.  The  key  variable  in  the  experiments  were  the  different  mappings 
to  be  learned.  The  training  process  was  stopped  at  ^)ecified  increments  of  training  iterations,  and  the  performance 
of  the  ANN  was  evaluated,  via  counting  the  number  of  bits  of  the  output  mapping  that  were  correctly  learned.  The 
initial  experiments  reported  in  [9]  focused  on  just  those  functions  of  the  ABD:ACD  decomposable  type,  all  of 
which  are  of  GSM  structural  type  4,  and  an  equivalent  number  of  non-decomposable  functions  (type  ABCD), 
which  are  of  GSM  structural  type  6.  The  examples  ffom  structural  type  6  were  selected  intuitively  to  correspond  in 
some  plausible  way  to  each  of  the  ABD:ACD  functions  of  type  4.  Those  preliminary  results  paved  the  way  to  the 
more  extensive  experiments  described  in  this  paper.  These  experiments  included  examples  from  all  six  structural 
types. 

First,  experiments  were  run  to  determine  the  size  hidden  layer  needed  in  the  non-decomposed  ANN  structure  to 
learn  the  examples  taken  from  the  ABCD  class.  We  settled  on  a  fully-connected,  feed-forward  strucmre  with  one 
hidden  layer  of  4  elements,  and  this  led  to  a  decomposed  ANN  structure  (via  removing  selected  connections)  with 
each  sub-structure  in  the  hidden  layer  comprising  2  elements.  The  conjecture  was  that  while  the  non-decomposed 
(more  general)  ANN  would  be  able  to  learn  all  the  mappings  (i.e.,  both  the  ABCD  types  and  the  ABD:ACD  types), 
the  modularized  ANN  would  not  be  able  to  learn  the  ABCD  mappings.  Further,  it  was  conjectmed  that  since  the 
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structure  of  the  modularized  ANN  in  some  sense  mirrored  the  known  structure  of  the  ABD:ACD  mappings,  the 
modularized  ANN  would  be  able  to  learn  the  ABD:ACD  mappings  'more  easily'  than  the  more  general  ANN 
structure  would,  and  (more  importantly)  it  was  conjectured  that  if  the  two  structures  were  each  trained  on  partial 
data  from  an  ABD:ACD  mapping,  then  the  decomposed  ANN  would  have  better  generalization  performance  than 
the  more  general  ANN  structure  would  (since  its  performance  subset  (PS)  would  be  smaller). 

Let  us  use  the  vocabulary  introduced  in  Section  2  to  discuss  the  6  groupings  of  functions  used  in  this  study.  As 
noted  earlier,  GSM  structural  type  6  refers  to  the  most  complex  case,  and  thus  these  functions  are  expected  to 
require  the  most  general  ANN  structure,  i.e.,  a  fully  connected  one.  The  ABD:  ACD  functions  used  above  are  from 
type  4,  and  we  have  noted  that  the  modularized  ANN  structure  of  Figure  lb  works  for  these  functions.  The  fully 
coimected  ANN  structure  selected  has  a  performance  subset  (PS)  that  covers  the  entire  set  of  256  possible  3-input, 
1-output  Boolean  mappings  (as  noted  earlier,  had  the  number  of  inputs  been  larger  than  approximately  30,  this 
would  not  be  physically  possible).  The  modularized  ANN  structure  used,  however,  has  a  smaller  PS.  The  way  the 
modularization  was  done  in  this  case  was  to  divide  up  the  elements  in  the  hidden  layer  equally  to  the  two  parti¬ 
tions.  This  can  be  considered  in  the  same  way  as  we  discussed  weight  pruning  earlier  to  infer  that  the  PS  of  the 
modularized  ANN  is  a  reduced  version  of  the  more  general  ANN'S  PS  (in  this  case,  trivially  so,  since  the  larger  PS 
includes  the  entire  set  of  possible  maps).  A  question  of  basic  interest  in  the  present  research  is  how  does  the  PS  of 
the  modularized  ANN  relate  to  the  set  of  mappings  associated  with  the  various  GSM  structural  types?  We  know 
that  the  PS  of  the  modularized  structure  does  not  contain  some  of  the  mappings  of  type  6  (when  we  trained  on 
those  mappings  that  intuitively  were  among  the  "hardest"  of  these,  the  modularized  ANN  did  not  learn  them).  We 
also  know  that  the  PS  of  the  modularized  structure  contains  all  of  the  mappings  of  type  4  (the  ABD:  ACD  type),  as 
it  learned  all  of  these.  Since  each  structural  type  subsumes  the  lower  types,  we  expect  the  PS  of  the  modularized 
ANN  structure  designed  according  to  structure  type  4  requirements  to  include  the  mappings  corresponding  to  the 
lower  types.  The  experiments  bore  this  out. 

In  addition,  however,  the  full  set  of  experiments  show  that  the  PS  of  the  modularized  ANN  structure  is  in  fact 
larger  than  just  the  mappings  of  GSM  structural  tyi)e  4  (and  the  subsumed  structure  types  3,  2  &  1).  The  modular¬ 
ized  ANN  was  able  to  learn  all  the  mappings  of  type  5,  and  further,  some  of  the  mappings  of  type  6.  These  results 
indicate  that  if  we  select  a  modularized  ANN  structure  based  on  the  GSM  structural  type  inferred  via  GSM  analysis 
of  I/O  pairs  of  data  from  the  problem  domain,  then  the  design  is  ’’conservative"  in  the  sense  that  its  PS  is  at  least 
big  enough  to  conUun  the  mappings  of  the  inferred  GSM  structural  type.  The  fact  that  the  PS  is  larger  than  just 
the  mappings  contained  in  the  inferred  GSM  set  can  be  explained  as  follows;  the  mappings  being  explored  are 
Boolean,  i.e.,  all  variables  are  binary.  While  the  inputs  to  the  ANN  are  binary,  and  the  output  neurode  learns  to 
give  binary  outputs,  the  hidden  neurode  values  are  not  constrained  (in  our  experiments)  to  binary  values. 
Accordingly,  it  is  clear  that  the  PS  of  the  ANN  prestucturing  selected  will  be  larger  than  the  set  of  mappings  of  the 
inferred  GSM  set.  For  the  3-input  case  the  size  of  the  ANN’s  PS  reached  up  into  structural  type  6  (not  all  of  it 
though).  For  a  number  of  inputs  n,  larger  than  the  3  here,  the  number  of  GSM  structural  types  will  be 
significantly  larger  than  the  6  associated  with  the  n=3  case.  Our  analysis  so  far  gives  us  hints  suggesting  the 
following  speculation;  the  PS  of  the  ANN  designed  via  the  GSM  structural  information  (i.e.  the  selected  structure 
in  the  GSM  lattice)  will  contain  functions  within  the  "neighborhood"  of  the  GSM  structure  identified.  The  term 
neighborhood  here  is  intended  to  mean  within  approximately  2  levels  further  up  from  the  one  selected  in  the  GSM 
lattice  of  structures.  In  the  4  variable  case  (3  inputs,  1  output),  since  the  GSM  lattice  of  structures  is  so  small,  the 
"neighborhood"  reached  into  the  top  level  of  the  lattice.  However,  for  larger  values  of  n,  the  GSM  lattice  will  have 
significantly  more  levels,  so  the  "neighborhood"  could  be  a  rather  small  fraction  of  the  total  range.  Accordingly, 
the  relative  size  of  the  prestructured  ANN's  PS  will  be  a  small  fraction  of  the  size  of  the  collection  of  mappings  up 
to  the  top  of  the  lattice,  and  thus  the  difference  between  a  general  ANN's  PS  and  that  of  a  prestructured  ANN  will 
be  greater,  and  thus  the  prestructuring  will  pay  even  better  dividends  than  those  already  discussed  for  the  n=3  case. 
We  believe  that  the  principle  has  been  demonstrated,  but  there  remains  yet  a  significant  amount  of  work  to  analyze 
even  the  5-input  case. 

We  pause  here  to  mention  that  all  the  3 -input,  1 -output  experiments  were  carried  out  with  a  lull  training  set,  where 
the  research  objective  was  to  observe  the  learning  process.  In  these  cases,  the  question  of  generalization  was  not  at 
issue.  But,  when  we  do  move  on  to  consider  the  generalization  question,  the  kind  of  knowledge  available  about  the 
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functions  we  are  exploring  is  very  useful.  Since  there  is  a  definite  (known)  structure  for  the  functions  being 
learned,  we  are  in  a  position  to  constructively  design  a  subset  of  the  possible  input  patterns  for  training  the  ANN 
that  theoretically  contains  enough  information  to  infer  the  entire  mapping.  After  the  ANN  trains  on  this  subset,  to 
the  extent  that  its  structure  really  is  tailored  to  the  structure  of  the  desired  mapping,  then  we  should  expect  the 
ANN  to  generalize  well.  The  tetter  the  structural  match,  the  closer  to  100%  generalization.  To  cany  out  a 
preliminary  experiment  related  to  generalization,  a  set  of  four  S-input,  1 -output  functions,  decomposable  in  a  way 
indicated  in  Figure  3,  were  crafted.  This  selection  was  made  bemuse  the  3-input  case  used  for  the  rest  of  the 
experiments  was  judged  too  limiting  for  carrying  out  the  desired  generalization  experiment.  By  construction  of 
these  functions,  we  were  able  to  select  a  training  set  comprising  only  50%  of  the  possible  input  patterns  (i.e.,  just 
half  the  mtqrping)  which  we  knew  theoretically  contained  sufficient  information  from  which  to  infer  the  total 
mapping,  l^s  set  was  used  to  train  both  a  general  (fully  connected)  S-input  ANN  and  a  modularized  5-input 
AIW  (cf  Figure  3).  For  the  decomposed  functions,  both  structures  learned  the  training  set  perfectly.  However, 
there  was  a  big  difference  in  the  generalization  tests;  1)  the  modularized  ANN  gave  perfect  responses  for  the  I/O 
pairs  not  seen  during  training,  and  2)  the  general  ANN  averaged  only  55%  correct  responses  (nearly  random)  on 
these  test  inputs.  Also,  four  non-decomposable  fimctions  were  selected,  and  both  ANN  structures  trained  on  them, 
again  using  50%  of  the  possible  input  patterns.  The  general  ANN  learned  the  training  set,  while  the  modularized 
ANN  did  poorly.  The  modularized  ANN  did  poorly  at  generalizing,  and  so  did  the  general  ANN.  While  these 
experiments  us^  but  a  small  fraction  of  the  possible  mappings  in  the  5-input,  1 -output  context,  the  experimental 
procedure  of  constructing  a  focused  experiment  and  having  this  give  results  which  support  the  hypothesis  carries 
reasonable  convincing  power  ~  especially  since  it  was  constructed  such  that  if  the  experiment  gave  negative  results 
(counter-example),  it  would  have  significantly  imdermined  the  basic  premise  of  the  approach. 
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Figure  3.  Modularized  ANN,  implementing  a  5-input,  1-output  ABCFiCDEF  system.  Used  for  generalization  test. 


4  Proxy  Measures  for  Complexity? 

To  recapitulate,  the  assumption  is  that  we  start  with  a  set  of  I/O  data  for  the  problem  domain,  specifically  here, 
binary  VO  data.  Next,  we  submit  this  data  to  what  we  are  calling  GSM  structural  analysis,  and  this  analysis 
assigns  a  structural  type  to  the  data  (ranging  fi-om  type  6  to  type  1  for  the  4-variable  case).  This  strucmre  number 
is  a  kind  of  complexity  measure  (6  being  the  most  complex),  relating  to  the  decomposability  of  the  Boolean 
function  infereed  to  underlie  the  given  data.  This  yields  a  rather  coarse  coding  of  the  possible  256  functions  being 
considered,  and  as  might  be  expected,  within  each  category,  there  will  be  a  gradation  of  the  degree  of  complexity, 
if  we  had  a  finer  way  to  measure  it.  Nevertheless,  the  premise  here  is  that  even  this  rather  coarse  measure  can  be 
put  to  good  use  in  modularizing  ANNs  —  where  'good'  here  relates  to  improved  learning  speed,  and  more 
impo  'antly,  improved  generalization  performance. 

The  88  functions  studied  in  these  experiments  were  sorted  according  to  the  'learning  difficulty'  experienced  by  the 
1)  fully-interconnected  ANN  and  2)  the  modularized  ANN.  In  addition,  the  same  fimctions  were  sorted  according 
to  four  different  measures  in  the  literature  dealing  with  Boolean  fimctions,  namely,  the  already  discussed  GSM 
structural  type  (range  1-6)  (31,  and  in  addition  the  Lambda-count  (range  0-4)  [4],  Fluency  (range  1-9)  (theory 
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developed  in  [15],  original  idea  by  Ashby  [13]),  and  Boolean  Length  (range  1-10)  ~  this  is  the  minimum  number 
of  Boolean  operations  necessaiy  to  represent  the  binary  string  (related  to  Kolmogorov  complexity)  [IS],  ValiKS  of 
these  four  conqrlexity  measures  for  the  88  Boolean  functions  selected  for  our  experiments  appear  in  [25].  In  com¬ 
paring  these  sorted  lists,  except  at  the  two  ends  (where  all  but  Lambda  basically  concurred),  it  turned  out  that  there 
/as  little,  if  any,  correlation  between  the  orderings  given  by  the  4  published  measures  of  complexity.  However, 
there  was  a  good  correlation  in  the  orderings  provided  1^  the  Boolean  Length  measure  and  the  'learning  difirculty' 
assigned  to  each  of  the  ANN  structwes  investigated,  llie  suggestion  based  on  these  observations  is  that  certain 
measures  of  the  learning  curve  of  specified  ANN  structures  might  be  used  as  a  proxy  measure  for  the  complexity  of 
certain  classes  of  functions  [10].  The  learning  curve  may  be  characterized  by  (at  least)  two  of  its  attributes;  the 
transient  portion  and  the  steady  state  level.  Here  the  steady  state  level  can  be  characterized  by  the  number  of  bits 
learned  (max.  of  8  for  the  3-input,  1-output  case),  the  transient  by  the  number  of  training  cycles  required  by  the 
ANN  structure  to  accomplish  the  learning.  Other  qualities  of  the  transient  suggest  themselves  via  visual  analysis, 
but  have  not  yet  been  reduced  to  qirantitative  expressions. 

While  the  investigation  here  was  based  upon  Boolean  functions  that  have  well  documented  properties  in  the 
literature,  there  might  be  a  basis  for  developing  this  approach  to  other  classes  of  functions.  Thus  we  have  a  turn  of 
events.  Instead  of  lamenting  the  difficulty  an  ANN  has  in  learning  a  given  task,  we  might  be  able  to  use  the 
learning  experience  of  specified  ANN  structures  as  proxy  measures  for  the  complexity  of  certain  classes  of 
functions 
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Abstract 

Tbe  tradeoff  between  output  error  minimization  and  generalization  is  (fiscussed,  and  a  new 
learning  algoridim  is  developed  to  improve  generalization  and  robust  dassificatioo  capability  for 
multi-layer  Perceptron.  Unlike  odier  metiiods  wfaidi  reduce  networic  complexity  by  putting 
restrictions  on  synaptic  weii^ts,  this  algorithm  increases  complexity  of  the  underlying  proUem 
by  imposing  appropriate  adcfitiooal  requirements  on  tbe  hidden-layer  neurons,  Le.  low  output 
sensitivity  to  die  input  values  or  equivalently  saturation  of  hidden-layer  activations.  Tbe 
addidonal  gradient-desoeot  term  turns  out  to  be  Hribbian,  and  this  new  algorithm  incorporates 
both  the  error  back-propagation  and  Hdibian  learning  rules.  The  algorithhm  also  utilizes  full 
power  of  existing  hardwares  to  find  solutiai  with  mariTmim  generalization  capability.  Computer 
simulation  denoonstrates  much  faster  learning  convergence  as  wdl  as  improved  robustness  for 
classification  and  hetero-assodation  problems. 

1.  Introduction 

Although  multi-layer  Percqitron  is  capable  of  solving  complicated  pattern  classification  and 
functional  approximation  problems,  good  generalization  is  adiievable  only  with  proper  rnmhinatitm 
of  the  training  data  size,  the  underlying  prdilem  complexity,  and  the  network  oomplBXity.  (Hush 
and  Home  1993)  For  given  number  of  tndning  data  and  prd)lem  complexity  smaller  networks 
are  not  cqxible  of  representing  die  problem  accurately.  On  the  other  hand  larger  networks  with 
too  many  synaptic  weights  suffer  from  poor  generalization.  To  improve  the  generalization 
capability  in  this  case,  one  may  put  restrictions  on  the  synaptic  weiidits  and  reduce  the  network 
complexity.  Synaptic  weight  pruning  (LeCun  et  aL  1990),  local  cmmections  and  weight  sharing 
(Fukushima  1968,  1993;  Waibel  et  dL  1989;  LeCun  et  aL  1990),  weight  decay  (Hanson  and  Riatt 
1969),  wmght  efinmnation  (Weigend  et  aL  1990),  and  soft  wdght  sharing  (Nowlan  and  Ifinton 
1992)  all  bdong  to  this  approach.  However,  for  many  practical  hardware  implenaentations  the 
network  ootxiplBxity  is  already  determined  by  the  hardware  itsdf  and  one  would  like  to  fuHy 
utilize  existing  hardware  capability.  Also  it  is  not  easy  to  modity  the  network  architecture  in 
hardware  implementatiais. 

In  this  paper  we  propose  the  odier  iqproach,  i.e.  to  increase  the  underlying  problem  complexity. 
By  properly  imposing  additional  recjuirements  for  low  sensitivity  on  the  hidden-layer  neurons,  the 
problem  ccnnplexity  is  increased  to  result  in  better  generalization.  The  neural  network 
architecture  and  major  parameters,  e.x  numbers  of  synapses  and  hidden-layer  neurons,  are 
unchanged  during  learning.  The  learning  algcxithm  just  utilizes  maximum  power  of  the 
hardware  availabla  Also,  the  gradient-descent  learning  algorithm  happends  to  incorporate  both 
the  error  badr-propagation  and  Hethian  learning  rules. 
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2.  Saturatioa  Requirements  on  Hidden-layer  Neurons 

One  of  die  most  frequent  symptom  of  the  poor  generalization  with  overfitting  is  errcmeous  high 
sensitivity  of  output  values  on  input  values.  For  good  generalization  and  robust  Ha««ifirftKnn  we 
would  like  to  reduce  the  unreasonably-high  sensitivity.  By  applying  rule  one  obtains  the 
sensitivity  as 

-Ig-  =  Wj,"  ,  (1) 

'^ere  x  and  y  are  input  and  output  vector,  respectively,  and  and  Wjf*  are  synaptic 

interconnections.  Although  our  approach  can  be  extended  to  general  multi-layer  ardutecture, 

only  2-layer  Perc^tron  with  linear  output  neurons  is  considered  here  for  simplicity.  The  /'(.) 
is  derivative  of  Sigmoid  nonlinear  function  at  hidden-layer  neurons,  and  hj  =  is  post- 

synaptic  neural  activations.  From  Eq.(l)  the  soisitivity  can  be  marfp  smaller  by  fcaxdng  the 

hidden-layer  activations  to  be  saturated,  Le.  f'ih})  ==  0,  and  this  aririitinnat  requironent 

increases  the  problem  conqdexity  for  better  generalizatioa  Instead  of  standard  output  error  we 
define  a  new  error  as 

E  =  Eo  *  y  Eh  ^  TEI  +  TE%  ,  (2) 

^lere  the  £5  =  is  output  aror  and  the  is  the  additional 

error  defined  fixun  the  hidden-layer  neural  activaticxis.  The  tf  and  yf  are  target  and  output 
values  of  the  fth  output  neuron  for  the  sth  stared  pattern,  and  ^  is  the  corresponding 
post-synaptic  value  for  the  jfix  hidden-layer  neuroa  Here  M,  No,  and  Nh  are  number  of 
siored  patterns,  number  of  output  neurons,  and  number  of  hidden-layer  neurons,  rei^jectively,  and 
die  errors  are  normalized  with  these  numbers.  If  the  neural  activation  of  the  hidden-layer  stays 
at  linear  region  of  the  Sigir^  function,  it  becomes  sensitive  to  the  input  noise  and  high 

hidden-layer  error  £J1  is  asdgned.  (The  above  ttefinition  of  £%  results  in 

for  bipolar  hyperbolic  tangent  Sigmoid  function  and  Z  f(.h1)  [l-/(^)]  for  unqxilar 

Sigmcnd  function.)  It  is  worth  noting  ttiat  both  die  output  error  Eo  and  hidden-layer  error  Eh 
are  normalized  to  take  similar  values,  le.  around  0.5  for  bipolar  Sigmoid  and  0.25  for  unipdar 
Sigmoid,  for  very  small  initial  synapses  and  converge  to  0  by  training.  By  minimizing  the  Eh 
one  may  push  the  network  into  nonlinear  region  for  improved  robustness.  The  y  rpresoits 
rdative  significance  of  the  hidden-layer  enor  Eh  over  oudmt  error  Eo. 

The  networic  is  trained  by  steepest-descent  error  minimization  algorithm  as  usual  Although  the 
last  layer  is  not  affected  by  this  additional  error  tprm,  the  partial  derivative  of  total  error  E  with 
respect  to  each  weight  in  the  first  layer  nov  ins  additional  term  and  the  weight  update 
becomes 
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(3) 


^^lere  the  sj  is  badc-propagated  emr  on  the  /th  hidden-layer  neuron  for  the  sth  training 

pattern  as  usual  The  t)  is  a  teaming  coefficient,  and  /*(.)  is  second  derivative  of  the 
hyperbdic  tangent  Signal  function  /(.).  The  additional  term  in  Eq.(3)  higvends  to  follow 
EteblHan  teammg  rule,  and  this  new  teaming  algorithm  incorporates  two  popular  teaming 
algorithms,  i.&  the  error  back-propagation  and  Hditnan  teaming  rules.  It  is  worth  noting  that 

the  Hd>bian  term  is  also  multiplied  by  f‘(M\  which  prevents  the  synaptic  weights  from 
indefinite  increase  or  decrease. 

With  very  small  initial  synaptic  values  the  and  the  Hebbian  term  does  not  contribute 

rniirfi.  As  teaming  process  goes  on  by  the  first  error  badc-propagaticHi  term  only,  the  hidden- 
layer  activation  values  become  non-trlvial  and  the  I^bian  term  starts  contributing  to  push 
hidden-layer  activatioos  to  die  saturation  regions. 


Although  the  author's  previous  work  to  merge  Hdibian  teaming  and  error  badc-propagation  (Kdi 
et  dL  1900)  was  based  on  correlation  matiix  synapses,  this  new  algorithm  rqvesents  more 
general  adqitive  Hebbian  learning.  Actually  it  shares  more  with  Radial  Basis  Function  networics. 
(Hartman  et  dL  1990;  Lee  and  Kil  1991)  Bodi  networks  incorporate  competedve  Hdibian  teaming 
for  the  first  layer  and  error  nuumization  for  the  second  layer.  However,  instead  of  localized 
(knissian  nonlinearity  used  fn*  the  Radial  Basis  Function  networir,  our  network  utilizes 
monotonically  increasing  global  Sigmdd  nonlinearity  on  the  hidden-layer  neurons,  which  allows 
I  gradient-descent  algorithm  throughout  the  vdide  network  with  less  problems.  Also.  syn£g>tic 

weights  for  both  the  first  layer  and  secoid  layer  are  adapted  simultaneously  here. 

1 

I  3.  Simulation  Results 

'  The  proposed  learning  algorithm  is  apptied  to  classification  and  hetero-assodation  problems. 

Results  for  clu%>siit  nation  of  10  binary  patterns  are  shown  in  Figures  1  and  2.  The  mnhbers  of 
iigrut,  hidden  output  layers  are  set  to  35,  30,  and  10,  respectively.  In  Fig.l  learning 
convergences  are  shown  in  log-log  scate.  Fig.l(a)  shows  tiie  total  errors  as  functions  of 
!  ioimig  epodi  for  different  t,  le.  rdative  significance  of  the  hidden-layer  error  Eh  over  output 
I  error  Eo.  The  new  algorithm  converges  much  faster  than  starxlard  error  back-propagation  (  y 

I  =0),  and  the  trained  results  are  not  sensitive  on  y.  As  shown  in  Figs.l(b)  and  (c),  the  output 

I  error  Eo  drops  very  raindly  after  initial  learning  stage  (around  10  epoches  in  this  case),  and 

becomes  mudi  smaller  than  y  Eh  thereafter.  This  is  the  pdnt  when  the  Hd}bian  learning  term 
in  Ed.(3)  starts  contributing.  The  domiDance  of  the  hidden-leyer  error  at  tiie  later  stage 
explains  the  insensiti>dty  of  this  algorithm  to  the  y. 

I  J^ior  correction  probabilities  after  learning  are  plotted  as  functions  of  Hamming  distances  in 

Figs.  2  and  3.  In  Flg.2  results  of  the  same  classification  problem  is  plotted,  while  hetero- 
assodation  probtems  is  shown  in  Fig.3.  Both  input  and  output  layers  have  35  binary  neurons, 
and  30  hidden-layer  neurons  are  used  in  the  hetero-assodation  problem  At  each  Hamming 
'  distance^  La  number  of  different  bits  with  a  stored  pattern,  100  test  patterns  are  rancknxdy 

I 

I 


ni-327 


I - 

gaieratBd  to  satisfy  the  Hamming 
distance  with  eadi  of  the  10  stored 
patterns,  and  their  overall  performances 
ate  collected  for  the  1000  test  patterns. 
In  Figs2(a}  and  3(a)  the  adaptive 
training  was  stopped  at  E  =  0l03.  while 
die  E  was  furdier  reduced  to  0.001  in 
Flgs.2(b)  and  3(b).  When  the  learning 
was  not  complete  the  correct  classifica¬ 
tion  probability  is  greatly  affected  by 
the  T  as  shown  in  Figs2(a)  and  3(a). 
However,  when  enou^di  learning  was 
performed  in  Figs.  2(b)  and  3(b),  die 
correct  classification  or  association 
probalMlity  beconoes  naich  higher  and 
also  insensitive  to  the  y.  These  figures 
cleaily  demonstratemuch  robust  dassi- 
ficatkm  and  hetero-association  c^iability 
of  the  proposed  algoridim  compared  to 
the  standard  error  back-propagation 
algorithm  (  y  =  0). 

The  prcvosed  hybrid  learning  algorithm 
is  being  implemented  by  modular 
neuro-diip  set  The  chg>-set  consists 
of  one  synapse  cfag)  and  one  soma  chip. 
Eadi  synapse  cell  is  congxised  of  one 
capacitor  for  weight  storage,  and  3 
analog  multqiliers,  one  for  feed-forward 
signal  flow,  another  for  error  back- 
propagation,  and  the  odier  for  weight 
rgxlate  outer-products.  In  the  soma 
chip  the  Sigmoid  and  its  derivatives  are 
calculated  with  current  sum.  Also,  the 
back-ianpegated  error  can  be  acfaled  up 
with  neural  activations  for  the  hybrid 
learning.  These  chip  sets  are  designed 
to  support  modular  concepts  for  generic 
neuro-computers  with  on-chm  learning 
capalnlity. 

4.  Conclusioia 

In  this  ptgier  we  have  presented  a  new 
supervised  learning  algoritiun  for  2- 
layer  feed-forward  neural  networks  for 
robust  classification  and  hetero-associa¬ 
tion  problenos.  By  fbrdng  the  hidden- 


Fig.l  Learning  characteristics  for  a  classification 
probtem.  (a)  Total  error  vs.  learning  ^xxh  for 
classification  of  10  random  patterns.  Here,  ”o”, 
and  'x*  denote  cases  for  y=0,  0.L  and  0.5, 
respectivdy.  (b)  Etror  vs.  learning  epoch  for  y=0.1. 
(c)  Error  vs.  learning  epoch  for  y  =0.5.  Here,  "o", 
and  'x*  denote  the  total  error  E,  output  error 
Eo,  and  hidden-layer  error  y  Eh,  respectively. 
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borer  neurons  saturated,  we  are  able  to  increase  problem  complexity  and  improve  generalizs^km 
capability.  Also  die  gradient-descent  algoritbm  turns  out  to  incorporate  Hebbian  learning  rule, 
and  converges  much  faster  than  standard  badc-profngatioa  algorithm.  Extantian  to  general 
multi-layar  architacture  seems  straightforward.  Modular  neuro-cbip  to  incorporate  this  learning 
algorithm  is  9being  devdoped.  Also,  adaptive  modification  of  the  y,  relative  significmoe  of  the 
hidden-layer  error  Eh  over  oudut  error  Eo,  is  currendy  being  investigated.  Starting  with 
arbitrary  r  one  goes  on  the  teaming  process.  After  reaching  the  steady  and  slow  error 
reduction  stage;  the  output  error  may  start  increasing  again  due  to  the  reladvdy  hbdi  y  vahie 


Fig.2  Correct  classification  probabilities  as  functions  of  Hamming  distance  for  a  classifier 
problem,  (a)  at  total  error  E  »  0.03  and  (b)  at  E  -  0.0001.  Here,  'o',  'a',  and  *x*  denote 
cases  with  y  =  0,  0.1,  and  Ofi,  reflectivity.  The  y=  0  case  correfxmds  to  standard  error 
back-propagatian  learning  rule. 


Hamming  Distance  Hamming  Distance 


(a)  (b) 


Pig.3  Cocrect  association  probabilities  as  fiuxdions  of  Hamming  distance  for  a  hetero-association 
problem,  (a)  at  total  error  E  -  0.03  and  (b)  at  E  =  0.0001.  Here,  *o”,  and  'x*  denot 
cases  with  r  =  0;  0.1,  and  0J5,  respectivty.  The  y=  0  case  ccrrefxmds  to  standard  error 
back-propagation  learning  rule 
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This  means  the  network  oxiqdexity  is  not  large  enoufl^  both  to  minimize  the  ou^t  error  and  to 
saturate  the  hiddBn~la3rer  activations.  In  this  case  the  y  can  be  made  smaller  to  keep  the 
output  eiTQr  goim  dowa 
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Abstract 

The  Badtpropagadon  technique  for  learning  internal  representations  in  multi-layer  neural 
networks  is  an  efifective  approach  for  solution  of  the  gradient  descent  problem.  However,  being  a 
deterministic  solution,  it  will  attempt  to  take  the  best  path  to  the  nearest  error  minima,  wbetba  global  or 
local.  If  a  local  minima  is  reached,  the  network  &ils  to  converge  and  either  will  not  learn  or  will  learn  a 
poor  aniroximation  of  the  solution. 

This  paper  introduces  a  novel  stochastic  approach  to  the  Backpropagation  model  based  on 
Simulated  Armealing.  The  model  is  ctesigned  to  provide  an  effective  means  oi  escape  from  local  minima. 
This  technique  augments  the  traditional  gradient  descent  learning  scheme  with  a  MdropoUs  lot^.  This 
extension  of  the  algorithm  is  shown  to  be  modeled  as  a  Markov  chain  consisting  of  a  Markov  neighborhood, 
a  selection  function,  an  acceptance  function,  and  a  cooling  schedule.  The  Markov  neighborhood  is  defined 
as  the  possible  next  states  of  the  network  and  is  a  combination  of  the  gradient  descent  weight  deltas  and  a 
noise  injection  method.  The  selection  function  provides  for  the  probabilistic  selection  of  the  proposed  next 
state  of  the  network.  Several  alternative  noise  injection  metluds  and  selection  functions  are  presented 
complete  with  experimental  data.  The  acceptance  function  determines  the  probability  of  acceptance  of  a 
new  network  state.  The  acceptance  function  for  this  model  is  based  on  the  total  network  error.  Simulated 
Armealing  is  highly  dependent  upon  cooling  schedules  and  several  alternative  cooling  schedules  are 
presented.  Both  static  and  ttynamic  approaches  to  cooling  are  examined.  The  prevention  of  premature 
quenching  is  addressed  and  the  selection  of  desirable  quenching  conditions  are  defined. 

In  experimental  results  the  tystem  is  shown  to  converge  with  a  much  higher  degree  of  reliability. 
It  is  also  shown  to  converge  more  reli^ly  and  much  faster  than  traditional  noise  insertion  techniques.  Due 
to  the  characteristics  of  the  cooling  schedule,  the  system  also  demonstrates  a  more  consistent  training 
profile. 
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Abstract 

Badcpiopagation  is  a  powerful  leuning  algorithm,  but  its  restriction  to  continuously  differentiable 
functions  limits  its  use  in  many  applications.  One  important  example  is  training  a  multi-layered  neural 
network  of  hard-limiting  units  (signums  instead  of  sigmoids).  Another  ex^unple  is  a  control  system  that 
uses  discrete-level  actuators,  such  as  our  free-flying  space  robot  model  equipped  with  on-off  gas  thrusters. 

A  new  technique  is  presented  that  extends  backpropagation  to  allow  for  discrete-valued  functions. 

Each  signnm  that  exists  at  run-time  is  temporarily  replaced  with  a  sigmoid  during  training,  and  noise  is 
injected  at  the  input  to  the  sigmoid.  The  noise  prevents  the  use  of  the  smooth  transition  region  of  the 
sigmoid  as  the  primary  means  of  solution.  The  effect  is  that  the  sigmoid  outputs  are  close  to  hard-limited 
during  training  so  there  is  not  a  significant  performance  reduction  when  the  signums  are  re-introduced 
at  run-time.  The  use  of  differentiable  approximating  functions  allows  fast  learning  due  to  gradient-based 
optimisation.  The  noise  does  not  corrupt  the  gradient  estimation  algorithm,  so  no  modifications  are 
needed  on  the  backward  error  propagation. 

The  viability  of  this  method  is  verified  by  applying  it  to  the  training  of  networks  with  hard-limiting 
units  as  well  as  a  complex  on-off  thruster  control  problem  associated  with  our  free-flying  space  robot. 

1  Introduction 

1.1  Optimization  with  discrete- valued  functions 

Optimization  methods  that  use  gradient  information  often  converge  much  faster  than  those  that  do  not.  Use 
of  the  backpropagation  algorithm  (BP)  [1][2]  to  get  this  gradient  information  for  training  neural  networks 
(NNs)  has  made  them  useful  in  many  applications;  however,  BP’s  requirement  of  continuous  differentiability, 
not  only  for  the  network  itself,  but  for  anything  that  the  error  is  beu^kpropagated  through  (e.g.  the  plemt 
model  in  a  control  problem),  limits  its  applicability. 

This  is  a  significant  limitation  since  there  are  many  applications  where  discrete- valued  states  arise.  For 
example:  on-off  thrusters  commonly  used  in  spacecraft;  other  systems  with  discrete-valued  inputs  and  out¬ 
puts;  and  neural  networks  built  with  signums  (aka  hard-limiters  or  Heaviside  step  functions)  rather  than 
sigmoids.  Signum  networks  may  be  preferred  to  sigmoidal  ones  due  to  hardware  considerations. 

In  cases  like  these,  one  choice  is  to  use  an  alternative  method  not  restricted  to  continuously  differentiable 
functions,  such  as  unsupervised  learning,  simulated  annealing,  or  a  genetic  algorithm,  but  these  are  usually 
significantly  slower  to  train,  because  they  do  not  use  gradient  information. 

1.2  Related  research 

Learning  algorithms  for  single-layer  networks  date  back  to  1960,  with  Widrow’s  ADALINE  [3]  and  Rosen¬ 
blatt’s  Perceptron  [4].  Unfortunately,  neither  of  these  methods  extend  directly  to  multiple  layers. 

MADALINE  Rule  I  was  a  two-layer  network  (one  hidden  layer)  that  had  a  trainable  first  layer,  but 
the  second  layer  was  a  fixed  logic  operation,  such  as  OR,  AND,  or  MAJ  (majority)  [5].  In  MADALINE 
Rule  II,  Winter  [6]  used  a  heuristic  approach  which  had  limited  success  at  training  a  two-layer  network 
of  hard-limiters  (ADALINEs).  These  methods  may  be  classified  as  “error-correction  rules”  rather  than 
“steepest-descent  rules”  (gradient-based)  [3]. 

In  recent  research  aimed  at  using  gradient-based  learning  for  multi-layer  signum  networks,  Bartlett  and 
Downs  [7]  use  weights  that  are  random  variables,  and  develop  a  training  algorithm  based  on  the  fact  that 
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the  resulting  probability  distribution  is  continuously  differentiable.  The  eJgorithm  is  limited  to  one  hidden 
layer,  requires  all  inputs  to  be  1  or  -1,  and  needs  extra  computation  to  estimate  the  gradient. 

Another  method  is  to  approximate  the  discrete- valued  functions  with  linear  functions  or  smooth  sigmoids 
during  the  learning  phase,  and  switch  to  the  true  discontinuous  functions  at  run-time.  This  is  similar  to 
the  original  ADALINE,  where  the  neuron  was  trained  on  its  linear  output,  but  in  operation,  this  output 
passed  through  a  signum  function  [3].  This  method  may  work  in  cases  where  the  behavior  of  the  system 
with  sigmoids  is  close  enough  to  that  of  the  real  system;  however,  this  assumption  is  very  often  unreliable. 

1.3  Outline  of  paper 

There  are  three  major  sections  in  this  paper.  In  Section  2,  the  technique  of  learning  with  noisy  sigmoids  is 
explained  and  the  training  algorithm  is  derived.  In  Section  3,  to  demonstrate  the  usefulness  of  the  method 
for  training  multi-layered  networks  of  hard-limiters,  it  is  applied  to  two  different  problems.  In  Section  4,  to 
demonstrate  application  to  a  complex  control  problem,  it  is  applied  to  a  thruster-mapping  problem  involving 
eight  on-off  thrusters  controlling  a  3-dof  free-flying  space  robot. 


2  Backpropagation  Learning  with  Noisy  Sigmoids 

2.1  Training  algorithm 

We  introduce  the  method  of  noise  injection  by  applying  it  to  the  training  of  a  single  hard-limiting  neuron, 
as  shown  in  Figure  1.  This  neuron  could  be  trained  with  the  ADALINE  or  perceptron  learning  rules,  but 
those  methods  do  not  extend  to  multiple  layers,  while  this  one  does. 


RUN-TIME 


TRAINING 

Forward 

Sweep 


net  =  X*W  y  ,  X  =  tl *1 »« ...  »,.f 

W  ■  lb  w,  w« . . .  w.f 


Backward 

Sweep 


=  1 


Figure  1:  Training  Algorithm 

During  training,  replace  discontinuous  signums  with  sigmoids,  and  inject  noise  before  the  sigmoid 
on  the  forward  sweep.  The  backward  sweep  calculation  is  the  same  as  standard  backpropagation. 


The  first  block  diagram  in  Figure  1  shows  the  neuron  as  it  appears  at  run-time,  a  dot-product  £uid 
hard-limiter.  For  simplicity  in  bookkeeping,  the  input,  X,  and  weight,  W,  vectors  are  augmented  to  include 
the  threshold  bias  for  the  output  function.  The  next  two  diagrams  show  the  neuron  during  training,  where 
the  signum  has  been  replaced  by  a  smooth  sigmoid  function.  The  input,  X,  is  propagated  through  the 
forward  sweep,  finally  resulting  in  an  error,  e,  and  a  cost.  The  derivative  of  this  cost  is  calculated  and 
propagated  though  the  backward  sweep,  resulting  in  a  dcosi/dX  to  be  propagated  to  more  units  upstream, 
and  a  9cost/5net  to  be  used  in  calculating  9cost/5W,  which  is  used  in  the  weight-update  algorithm. 

This  is  almost  the  same  as  training  a  standard  neuron  with  backpropagation  -  the  only  difference  involves 
the  injection  of  zero-mean  noise,  N ,  immediately  before  the  sigmoid.  While  the  mechanics  of  the  backward 
sweep  are  identical,  different  weight  updates  result  because  the  forward  sweep  resulted  in  a  different  error. 

Note  that  the  noise  injection  does  not  corrupt  the  calculation  of  dcrxi/dW  (just  as  the  desired  signal 
does  not).  Using  an  unmodified  beK;kward  sweep  is  not  only  the  simplest  thing  to  do,  it  does  precisely  the 
right  calculations  for  estimating  the  weight  gradient. 

What  makes  this  method  useful  is  that  as  the  noise  level  increases  to  cover  the  sigmoid’s  transition  region, 
adaptation  with  the  resulting  9cost/3W  leads  to  a  set  of  weights  that  work  well  for  the  signum  network. 

To  summarize,  the  training  algorithm  is: 


•  Replace  the  hard-limiters  with  sigmoids  during  treuning 

•  Inject  noise  immediately  before  the  sigmoids  on  the  forweud  sweep 

•  Use  the  exeict  same  backward  sweep  as  with  standard  backpropagation 
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2.2  Why  it  works 

Without  addition  of  noise,  the  network  may  train  using  values  in  the  sigmoid  transition  region  (roughly  -0.8 
to  0.8)  that  will  be  unavailable  at  run-time.  Simply  rounding  off  at  run-time  may  introduce  significant  errors. 
The  gosJ  of  noise  injection  is  to  move  neuron  activations  away  from  the  transition  region  during  training,  so 
round-off  error  will  be  small  when  the  discontinuous  functions  are  replaced. 

An  intuitive  reason  for  adding  the  noise  is  to  throw  the  neuron  off  its  transition  region,  and  effectively 
force  it  to  haurd-limit  at  the  high  or  low  value.  Figure  2  shows  how  the  neuron  output  distribution  changes  as 
the  noise  level  increases.  With  no  noise,  only  a  single  output  can  result,  but  as  noise  increases  to  cover  most 
of  the  transition  region,  the  output  distribution  approaches  that  of  a  hard-limiting  function.  Differentiability 
is  maintained,  however,  so  gradient  information  will  be  available  to  speed  up  learning.  Since  the  noise  has 
pushed  the  distribution  to  approximate  a  hard-limiting  non-linearity,  when  the  hard-limiter  is  re-introduced 
at  run-time  the  performance  degradation  will  be  small. 


Noise  =  0.2  Noise  =  0.5  Noise  =1.5  Noise  =  3  Input,  Noise  =  3  Output,  Noise  =  3 


Figure  2:  Effect  of  Input  Noise  Level  on  Sigmoid  Output  Distribution 
Lightly-shaded  region  represents  the  sigmoid  input  probability  distribution  (in  this  case,  -0.3  -t- 
noise ).  Darkly-shaded  region  is  the  sigmoid  output  distribution  (from  -1  to  1 ),  plotted  horizontally 
to  correspond  to  the  sigmoid  plot.  Each  distribution  has  an  area  of  1.  As  noise  level  increases,  and 
the  input  distribution  spreads  out,  the  sigmoid  output  approaches  that  of  a  hard-limiter,  while 
remaining  differentiable.  At  right,  input  and  output  distributions  are  plotted  separately. 

2.3  Extensions,  application  considerations 

2.3.1  Selection  of  noise  level 

One  concern  is  the  attenuating  effect  of  the  derivative-of-sigmoid  function.  When  back-propagated  through 
many  layers  of  near-saturated  sigmoids,  the  error  signal  is  attenuated  and  may  lead  to  slow  learning.  To 
handle  this  problem,  it  may  be  necessary  to  be  gradual  in  increasing  the  noise  level  -  slowly  push  the  outputs 
from  the  linear  region  to  the  hard-limits,  rather  than  all  at  once.  However,  since  all  the  experiments  presented 
here  had  a  single  layer  of  discontinuity,  no  such  gradual  increase  was  required. 

For  training  networks  with  simple  bi-level  sigmoids,  once  the  noise  reached  a  sufficient  level  (roughly  0.5 
and  3  in  two  different  applications),  there  was  no  degradation  if  it  were  increased  beyond  that  level.  The 
only  possible  drawback  is  the  attenuation  effect  mentioned  above.  The  required  noise  level  varies  in  different 
applications  depending  upon  how  sharp  the  decision  boundaries  would  be  with  no  noise  (i.e.  if  it’s  a  sharp 
sigmoid  to  begin  with,  not  much  noise  is  needed  to  force  it  off  the  transition  region). 

When  multi-level  sigmoids  are  used,  as  seen  in  Figure  9,  there  is  an  upper  limit  to  noise  level.  Too  much 
noise  may  cause  the  individual  sigmoids  to  overlap,  which  in  this  example  would  blur  out  the  middle  level. 

2.3.2  Discrete-valued  functions  other  than  bi-level  signums 

If  adapting  a  system  that  contains  discrete- valued  functions  that  are  not  simple  Heaviside  step  functions,  the 
method  may  work  if  a  continuously  differentiable  approximating  function  is  used.  For  example,  a  function 
whose  output  can  take  on  multiple  discrete  values  may  be  approximated  by  combining  multiple  sigmoid 
functions.  For  the  thruster  mapping  problem  described  in  Section  4,  the  thruster  can  take  on  three  states: 
forward,  off,  or  backward.  Two  bi-level  (-1,1)  sigmoids  were  summed  to  produce  a  tri-level  (-1,0,1)  sigmoid. 

2.3.3  Batch-learning 

The  randomness  introduced  with  the  addition  of  noise  could  make  learning  slow  because  of  the  reduction 
in  signal-to-noise  ratio  in  the  weight  gradient  estimation.  Batch-learning,  using  the  exact  same  training 
set  from  one  epoch  to  the  next  worked  well  (considering  the  “training  set”  to  include  the  “input  set”  and 
“noise  set”).  Freezing  the  training  set  and  noise  set  defines  a  fixed  deterministic  cost  hyper-surface.  With  a 
fixed  cost  function,  on-line  tuning  of  momentum  and  learning  rate  can  be  applied  to  dramatically  improve 
convergence  rate. 
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2.3.4  Optimization  of  discrete-valued  parameters 

Another  area  where  this  method  hae  potential  is  for  optimization  problems  that  have  discrete  valued  pa¬ 
rameters.  For  example,  a  design  optimization  problem  where  the  task  is  to  select  the  right  DC  motor,  pipe 
diaumeter,  or  gear  ratio  from  a  finite  set  of  discrete-valued  options.  It  is  expected  that  this  method  will 
extend  well  to  this  family  of  problems  [8]. 

3  Application  to  Training  Multi-Layer  Signum  Networks 

In  this  section,  this  method  is  shown  to  extend  to  multiple  layers  of  hard-limiting  units  with  no  modification. 
Figure  3  summarizes  the  method;  during  training,  replace  each  hard-limiter  with  a  sigmoid  and  zero-mean 
independent  noise  source.  Note  that  the  sharpness  of  the  sigmoids  does  not  matter  at  all  here,  since  the 
sharpness  factor  simply  multiplies  the  weights,  and  the  weights  are  adapted. 


Run-time  Training 

Figure  3:  A  multi-layer  signum  network,  seen  at  run-time  and  during  training 

In  the  first  test,  an  adaptive  3  —  5  —  4  signum  network  is  trained  to  emulate  the  input-output  mapping 
defined  by  an  independent,  fixed,  3— 10— 4 sigmoidal  network.  Fewer  hidden  neurons  are  used  in  the  adaptive 
network  to  ensure  that  overfitting  will  not  introduce  unnecessary  complications.  The  3-10-4  network’s 
fixed  weights  were  randomly  chosen  between  -2  and  2. 


Network  Performance  Average  Weight  Magnitude 


Noise  Level  H.  activation  activation  activation  activation 


Figure  4:  Direct  training  of  a  multi-layer  signum  network,  NN-generated  training  set 
Left;  with  higher  noise  levels,  performance  on  the  noisy  sigmoidal  network  approaches  that  of  the 
signum  network,  indicating  that  the  noisy  sigmoid  is  a  valid  (and  differentiable!)  approximation 
for  the  signum.  Right:  As  noise  increases,  the  network  adapts  to  sharpen  its  sigm'  '  causing 
the  first  layer  weights  to  increase,  and  the  sigmoid  output  distributions  to  approach  hard-limiters. 
Activation  distributions  were  collected  over  the  whole  training  set,  with  no  noise  added. 

Performance  is  shown  in  Figure  4.  Each  dot  on  the  graph  represents  the  final  performance  after  a  full 
training  run  (10,000  epochs  or  until  a  local  minimum  was  reached).  Seven  values  for  noise  level  were  chosen, 
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and  ten  different  network  initial  conditions  were  used  at  each  noise  value.  With  no  noise,  performance  is 
good  for  the  sigmoidal  network,  but  when  the  signums  are  reintroduced  at  run-time,  the  error  increases 
dramatic2diy.  One  point  is  off  the  graph  at  an  error  of  over  6  units.  As  noise  increases,  performance  on 
the  sigmoid  network  decreases,  as  expected,  but  the  signum-network-performance  improves,  and  approaches 
the  sigmoid-network-performance.  The  weight  magnitude  and  neuron  activation  distribution  plots  confirm 
that  as  noise  increases,  the  noisy  sigmoids  behave  like  hard-limiters.  Note  that  these  activation  distributions 
could  not  have  been  achieved  by  manually  increasing  the  sharpness  of  the  sigmoids  -  this  would  have  had  zero 
net  effect  since  the  network  would  adapt  the  first  layer  weights  to  exactly  counteract  the  sharpness  increase. 
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Figure  5;  Direct  training  of  a  network  of  hard-limiters  to  emulate  optimal  thruster  mapping 

In  the  second  application,  the  hard-limiting  network  is  trained  to  emulate  the  optimal  thruster  mapping, 
which  will  be  described  in  detail  in  the  next  section.  For  now,  this  mapping  is  used  as  an  independent  second 
test  of  the  method.  A  similar  dramatic  improvement  in  hard-limiting  performance  occurs  as  noise  increases 
past  about  0.5.  It  is  not  shown  on  the  plot,  but  good  performance  is  obtained  at  least  up  to  a  noise  level 
of  three.  The  training  set  for  this  mapping  represents  continuous  values  being  mapped  to  discrete  values,  so 
the  first-layer  weights  are  high  (indicating  sharp  decision  hyper-surfaces),  even  for  noise  =  0. 


4  Application  to  Robot  Thruster  Control 

4.1  Robot  Description 

Experiments  are  performed  using  a  mobile  robot,  shown  in  Figure  6,  that  operates  in  a  horizontal  plane,  using 
air-cushion  technology  to  simulate  the  drag-free  and  zero-g  characteristics  of  space[9].  The  three  degrees  of 
freedom  (*,  y,  0)  of  the  base  are  controlled  using  eight  thrusters  positioned  around  its  perimeter,  shown  in 
the  upper  right  of  Figure  6.  The  on-off  thrusters  substantially  complicate  the  control  design,  due  to  their 
discontinuous  nature  and  the  fact  that  each  thruster  simultaneously  produces  both  a  net  force  and  torque. 

The  overall  objective  is  to  use  a  set  of  eight  full-on,  full-off  air  jet  thrusters  to  approximate  a  continuous¬ 
valued  force  vector  that  is  commanded  by  the  position/attitude  control  law  of  the  mobile  robot.  The  neural 
network  determines  the  combination  of  thrusters  to  fire  that  will  generate  a  (normalized)  resultant  force 
vector  as  close  as  possible  to  that  commanded. 

4.2  Indirect  training,  Application  of  noisy  sigmoids 

Three  different  techniques  used  to  solve  this  thruster  mapping  problem  are  summarized  in  the  lower  right 
of  Figure  6.  The  first  implementation  used  an  exhaustive  search  at  each  sample  period  to  find  the  thruster 
pattern  minimizing  the  force  error  vector  [9].  Symmetries  are  used  to  reduce  the  search  space,  but  this 
method  relies  on  testing  every  possible  thruster  pattern  to  find  the  one  with  minimum  error.  The  second 
method  used  a  neural  network  that  had  been  trained  directly  to  emulate  the  optimal  mapping  produced  by 
the  exhaustive  search,  described  in  Section  3  and  in  [10]  [11]. 
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At  every  sample  period, 

given:  Fxdo,  Fy«c9,  Tddc* 

find:  Ti,  Tz, ... ,  Ts 

resulting  in:  Fxact,  Fyactf  TSact 

to  minimize:  |welgbtijig  •  [Fdes-Fact]| 


SEARCH  (optimal  solution) 


iTspt) 


» (Optimal  mappiaf 


DIRECT  TRAINING 


Fde*:  desired  force,  [PxOm,  Fy4M,  TMm] 
Feet:  actual  force,  [FaMi,  F>aet,  TtMtl 
T:  thruster  values,  [Ti,  Ts, ....  Ts) 

Topi:  T  that  minimises  the  cost  function 
error;  signal  used  to  train  network 


INDIRECT  TRAINING 


Figure  6:  Free-flying  robot,  Thruster  mapping  problem  definition,  Solution  methods 

Left:  The  robot  is  a  fully  self-contained  planar  laboratory-prototype  of  a  free-Bying  space  robot 
complete  with  on-board  gas,  thrusters,  electrical  power,  multi-processor  computer  system,  camera, 
wireless  Ethernet  data/communications  link,  and  two  cooperating  manipulators.  It  exhibits  nearly 
Aictionless  motion  as  it  Boats  above  a  granite  surface  plate  on  a  50  micron  thick  cushion  of  air. 
Right:  [Desired  forces  =>  Thruster  values]  mapping:  problem  definition,  solution  methods.  The 
on-off  thrusters  and  coupling  between  forces  and  torque  make  this  problem  difficult. 


The  third  method,  indirect  training,  is  presented  here  and  shown  in  Figure  7.  In  this  case,  the  optimal 
mapping  is  not  used,  and  the  NN  must  learn  the  mapping  through  experimentation  with  the  plant  model. 
This  requires  back- propagation  of  error  through  the  discontinuous  thrusters,  which  motivated  development 
of  the  noise  injection  method  presented  in  this  paper. 


Figure  7:  Thruster  mapping,  indirect  training  method 


Figure  8  shows  the  result  of  training  with  two  differentiable  thruster  models.  During  training  with  the 
continuous  thruster  models,  the  NN  produces  a  mapping  with  a  very  low  error,  but  when  the  signums  are 
replaced  at  run-time,  the  error  is  large.  This  is  because  the  network  learned  to  optimize  the  solution  using 
outputs  that  would  be  unavailable  at  run-time.  The  resulting  roundoff  error  is  unknown  to  the  NN  during 
training. 

Figure  9  shows  the  results  when  the  thrusters  are  modelled  by  noisy  tri-level  sigmoids.  With  noise  =  0, 
error  is  high,  corresponding  to  the  data  in  Figure  8,  but  as  noise  increases,  performance  approaches  that  of  the 
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Thnister  Models  Perfonnaiice  on  Actual  Plant,  Different  Models 


Input  (commanded  thrust)  run  number 


Figure  8:  Results  of  indirect  training,  two  differentiable  thruster  models 
The  aigmoid-based  approximation  {witbout  noise)  is  better  than  the  linear  mode],  but  has  limited 
performance.  The  results  from  direct  training  represent  a  lower  limit  for  comparison.  Mapping 
error  is  average  percent  error  above  the  optimal  mapping  (which  results  from  an  exhaustive  search 
of  all  possible  thruster  combinations).  The  shaded  areas  represent  the  mean  ±  a  for  ten  different 
runs.  3  —  10  —  4  layered  networks  were  used. 

network  trained  directly  (emulating  the  optimal  mapping).  The  direct-performance  represents  a  lower  bound 
set  by  the  functional  complexity  of  the  3—10  —  4  layered  network.  The  best  noise  value  in  this  application 
seems  to  be  around  0.15,  and  the  resulting  noisy  sigmoid  is  shown  in  the  left  side  of  Figure  9.  Examining 
this  figure,  the  sigmoid  shsirpness  stnd  noise  levels  seem  to  be  set  correctly  according  to  intuition.  As  noise 
increases  beyond  0.2,  error  increases  (as  the  “off’  region  of  the  sigmoid  becomes  blurred)  as  expected,  but 
the  method  is  fairly  robust  to  the  noise  value  selected. 


Noisy  Sigmoid  Thnister  Model  Perfonnance  on  Actual  Plant  vs.  Noise  Level 


Input  (commanded  thrust)  noise  level 

Figure  9:  Results  of  indirect  training,  noisy  tri-level  sigmoid  thruster  model 
Left:  the  sharpness  (4)  and  noise  level  (0.15)  for  the  noisy  tri-level  sigmoid  appear  to  be  intuitively 
correct.  Right:  as  noise  increases,  performance  approaches  that  of  the  network  trained  directly 
(emulating  the  optimal  mapping).  3—10  —  4  layered  networks  were  used. 

A  good  solution  results  when  noise  is  added  because  it  prevents  the  network  from  using  a  solution  that 
uses  non-saturated  portions  of  the  tri-level  sigmoid.  Such  a  solution  would  give  a  nearly  random  output  and 
high  error  during  training.  The  training  algorithm  must  find  a  solution  that  works  well  despite  the  noise 
addition.  This  means  the  expected  veilue  of  the  output  must  be  well  into  the  saturated  region  to  consistently 


work  well.  The  results  approximate  the  optimal  solution  very  well,  and  work  when  the  tri-level  sigmoids  are 


replaced  with  tri-level  signums. 
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5  Conclusions 

This  paper  has  described  a  new  technique  that  allows  backpropagation  learning  to  work  with  systems  contain¬ 
ing  discrete-valued  functions,  despite  the  discontinuity  that  exists  between  discrete  values.  The  modification 
to  b^tckpropagation  is  very  small,  simply  requiring  sigmoidal  approximation  of  the  discrete- valued  functions, 
and  careful  injection  of  noise  into  the  smooth  approximating  function  on  the  forward  sweep.  The  noise 
ii^jection  is  critical  to  ensuring  that  the  noisy  sigmoid  behaves  like  a  signum  during  training. 

Multi-layered  networks  of  hard-limiters  require  simpler  processing  hardware  than  do  multi-layered  sigmoid 
networks.  Sigmoid  networks  are  commonly  used  due  to  their  increased  functionality  as  well  as  the  lack  of  a 
reliable  training  algorithm  for  signum  networks.  Multi-layered  signum  networks  have  now  been  successfully 
tr2uned  using  this  noise  injection  method  in  two  different  applications,  clearly  demonstrating  its  usefulness 
in  this  area. 

Application  to  a  complex  thruster  control  problem,  with  implementation  on  a  laboratory  model  of  a  free- 
flying  space  robot,  has  demonstrated  the  method’s  realizability  and  usefulness  for  on-off  control  problems. 

In  each  application,  the  training  behavior  in  the  presence  of  noise  has  been  well-understood,  and  the 
algorithm  appears  to  be  relatively  robust  to  the  amplitude  of  the  injected  noise. 
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Abstract 

la  dasaic  backpiopagatioB  aeto,  m  iatrodaced  by  Ramelbait  et  al.[l],  tbe  wdgbU  are  modified  ac- 
cotdiBg  to  tke  metkod  of  ateepeat  deaceat.  The  goal  of  thia  weight  modificatioa  ia  to  miaimiae  the  error 
ia  aet-ontpnta  for  a  givea  traiaiag  set. 

Baaiag  apoa  Jacobs’  work  [2],  we  poiat  oat  drawbacks  of  steepest  deaceat  aad  suggest  improvemeats 
oa  it.  These  yidd  a  backpropagatioa  aet,  which  adjusts  its  weights  accordiag  to  a  parallel  coordiaate 
deaceat  method,  whose  parameters  ate  beiag  fuszy-coatrolkd. 


1  Introduction 

The  backpropagation  net  is  a  multilayer  neural  net  consisting  of  one  input-layer,  one  output-layer  and  at  least 
one  hidden-layer.  The  weights  w  =  (wi, . . .,  tOn)  of  this  net  are  modified  by  means  of  the  hackpropagaiion 
teaming  ni/e,  which  is  supposed  to  perform  steepest  descent  with  the  mesn-s^aared-error-function  F{w). 

To  accomplish  this,  Rumelhart  et  al.  introduced  the  generalised  delta  mle: 

^  _  r,VaF{w)  (1) 

V^F{iS)  is  the  gradient  of  F  with  respect  to  w  and  gives  the  direction  of  maximum  increase  relative  to  w. 
1}  >  0  is  a  constant,  which  is  referred  to  as  the  leamrate.  The  portion  of  —VaF(w)  by  which  the  weight 
vector  tu  is  moved,  is  determined  by  the  value  of  this  learnrate. 

Image(F)  can  be  interpreted  as  a  surface  over  the  space  of  weight  vectors.  F{w)  gives  the  ‘height’  of 
this  error  surface  at  w.  The  goal  of  all  weight  a4justments  is  to  find  a  w*  for  which  F  takes  on  a  global 
minimum  F{w*)  =  F„n  (typicaUy,  F^in  >  0). 

An  advantageous  property  of  the  method  of  steepest  descent  is  its  global  convergence,  which  may  only  lead 
to  a  local  noinimum  though.  Many  modifications  of  the  backpropagation  algorithm  that  find  a  minimum  of 
the  error  surface  more  quickly,  can  only  guarantee  local  convergence.  The  Quickprop  algorithm  is  w  example 
for  this;  it  implements  an  approximation  of  Newton’s  method  and  its  order  of  convergence  is  two. 

The  parallel  coordinate  descent  we  propose,  which  provides  each  weight  with  its  own  learnrate,  comes 
without  this  disadvantage  and  it  seems  to  find  its  way  along  the  error  surfaces  faster  than  the  original 
backpropagation  algorithm.  By  introducing  a  fuszy  control  of  some  descent  parameters  we  are  able  to 
further  improve  the  performance  of  the  algorithm. 

We  will  now  point  out  the  drawbacks  of  steepest  descent  and  consider  the  improvements  suggested  by 
Jacobs  [2],  thereby  putting  forward  coordinate  descent  and  the  delta-bar-delta  rule.  Subsequently  a  hybrid 
from  delta-bar-delta  and  momenfsm  version  is  introduced  and  a  fuzzy  control  of  this  hybrid  is  explained. 
In  the  end  we  test  our  new  algorithm  and  compare  its  performance  with  that  of  the  generalised  delta  rule 
and  delta-bar-delta  rule. 

*e-iii«il:  lippaSMth.aal-wtasatsr.S*,  frarisgaMtk.sai-mMistar.d*,  tankagaSMth.uii-anaBstar .da 
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2  Problems  with  the  Method  of  Steepest  Descent 

The  method  of  steepest  descent,  which  is  implemented  by  the  classic  backpropagation  learning  law  has  some 
weak  points,  that  take  effect  on  typically  shaped  error  surfaces. 

Hecht-Nielsen  [3]  makes  the  following  statements  about  the  shape  of  backpropagation  error  surfaces: 

•  Many  error  surfaces  have  extensive  flat  areas  and  troughs  with  little  slope. 

•  The  symmetry  of  the  net  weights  (each  weight  vector  behaves  equivalent  to  certain  permutations  of 
itself)  causes  many  global  minima  to  exist.  As  a  result,  most  error  surfaces  appear  rough  in  many 
dimensions. 

•  Error  surfaces  with  real  local  (not  global)  minima  at  a  ‘high’  errorlevel  do  exist.  Little  is  known  about 
the  position  and  number  of  these  minima. 

Jacobs  [2]  closely  examined  the  behaviour  of  steepest  descent  on  backpropagation  error  surfaces  and  reached 
these  conclusions: 

•  If  the  error  surface  is  flat  in  the  dimension  of  one  of  the  weights,  the  corresponding  derivative  is 
(absolutely  speaking)  small.  Because  of  the  gradient-component  being  that  small,  changing  the  corre¬ 
sponding  weight  by  a  portion  of  it  (determined  by  the  leamrate  q)  yields  only  a  slight  a4iustment  of 
the  weight.  If  the  error  surface  is  flat  in  idl  dimensions,  application  of  the  learning  law  takes  almost 
no  effect.  Consequently  the  progress  of  the  method  is  very  slow  in  situations  like  these;  it  may  even 
stop  because  of  computing-inaccuracies. 

•  If  —  on  the  other  hand  —  the  curvature  of  the  error  surface  is  high  for  some  weight-dimensions,  a 
related  problem  with  (absolutely  speaking)  large  gradient-components  arises.  The  weight  vector  may 
be  moved  too  far  -  thus  overshooting  the  minimum. 

•  The  gradient  gives  the  direction  of  the  steepest  ascent.  But  the  negative  gradient  does  not  necessarily 
show  the  shortest  way  to  a  minimum.  So  steepest  descent  may  make  a  detour  on  its  way  to  a  minimum 
and  thus  may  encounter  more  difficulties. 

3  Improving  the  Backpropagation  Learning  Algorithm 

Jacobs  [2]  discusses  some  improvements  to  the  generalised  delta  rule,  upon  which  we  enlarge: 

3.1  Parallel  Coordinate  Descent 

The  leamrate  ij  determines  decisively  by  what  amount  each  weight  is  adjusted.  Because  one  leamrate  for  all 
weights  cannot  allow  for  the  different  curvature  of  the  error  surface  in  each  dimension,  each  weight  should 
be  equipped  with  an  individual  leamrate. 

By  using  individual  learnrates,  the  learning  law  no  longer  moves  a  point  on  the  error  surface  in  the 
direction  of  the  negative  gradient  and  therefore  no  longer  performs  steepest  descent. 

Actually,  a  kind  of  coordinate  descent  is  now  executed.  This  does  not  minimise  F{w)  directly,  but  searches 
for  miam{F(w))  for  each  component  w,-  of  w.  Unlike  ‘normal’  coordinate  descent  methods,  which  change 
the  weights  one  at  a  time  (for  instance  the  Gauss-Southwell  method  [4]),  this  method  adjusts  all  components 
of  w  in  parallel. 

The  learning  law  implementing  this  parallel  coordinate  descent  looks  like  this  (it  is  derived  by  slightly 
changing  (1)  (to  allow  for  individual  learnrates)): 

n 

t-new  ^  ~'£^,^Ei,iVaF{w),  (2) 

(=1 

where  n  is  the  dimension  of  w;  ?;<  >  0  is  the  leamrate  corresponding  to  u/j;  Ei^i  is  a  n  x  n  Matrix,  with  every 
Component  =  0,  except  one  component  with  row  =  column  =  t,  which  is  =  1,  (1  <  (  <  n).  Now  holds: 

Ttieorem  3.1  If  ike  preliminaries  for  ike  convergence  of  ike  classic  hackpropagaiion  algoriikm  are  given, 
ike  parallel  coordinaie  desceni  meikod  will  also  converge  io  a  minimum  of  ike  error  surface. 

This  can  be  prooved  analogously  to  the  corresponding  proof  for  the  backpropagation  algorithm  (if  all  rn  are 
equal  the  method  behaves  like  steepest  descent). 
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3.2  The  Delta-Bar-Delta  Rule 


Because  the  error  surface  curvature  (in  each  dimension)  is  not  the  same  everywhere,  every  learnrate  rn  should 
be  allowed  to  vary.  This  variation  can  be  controlled  by  the  following  heuristics; 

When  the  sign  of  the  derivative  of  a  weight  is  the  same  on  several  consecutive  steps,  the  corresponding 
learnrate  should  be  increased  (we  suppose  the  error  surface  to  be  flat  in  this  situation).  When  the  sign 
alternates  on  consecutive  steps,  the  learnrate  should  be  decreased. 

The  deltarbar-delta  rule  was  developed  by  Jacobs  [2].  It  is  a  variation  of  the  generalised  delta  rule  and 
implements  the  heuristics  mentioned  above.  In  fact  it  consists  of  two  rules;  one  for  weight  adjustment  and 
the  other  for  control  of  the  learnrates. 

The  weights  are  modified  analogous  to  (2); 


w(t  +  1)  =  w(t)  -  ^  »h(0^i,t^t3^’(«5). 


(3) 


i=l 


where  ii;(t)  is  the  weight  vector’s  value  at  time  step  t  and  iji(t)  is  the  value  of  t]i  at  time  step  t. 

Every  learnrate  is  acljusted  according  to  tj(t  +  1)  =  ij(t)  +  A»j(t),  (for  convenience  we  write  t)  instead  of  »><) 
with: 


Ai7(t)  = 


where 


K 

,if6(t-l)«(t)>0 

,if6(t-l)i(<)<0 

0 

else 

_  dF(t) 

and 

3iu(t) 

(4) 


s(t)  =  (i-fi)6(t)  +  es(t-i) 


»=o 


K  >  0  and  <ft  €  [0, 1]  are  constants,  is  6  (0, 1],  k,4,6  are  the  same  for  every  learnrate. 


When  the  sign  of  the  current  (step  ()  derivative  and  that  of  the  exponential  average  at  step  (t  —  1)  are 
the  same  (ss  the  error  surface  is  flat),  the  learnrate  is  increased  by  k.  When  the  signs  are  different  (ft;  the 
curvature  of  the  error  surface  is  high)  the  learnrate  is  decreased  by  a  portion  (determined  by  ^)  of  itself. 

The  tfi  are  decreased  exponentially  by  the  S-S  rule,  thereby  guaranteeing  fast  decrease  and  rjj  >  0. 

The  increase  of  the  learnrates  is  done  linearly  to  prevent  them  from  increasing  too  quickly. 

The  effectiveness  of  the  net  depends  decisively  on  k:  Set  to  an  inadequately  smedl  value,  the  increase  of  the 
learnrates  will  take  place  too  slow.  If  k  is  set  too  large,  the  algorithm  will  become  very  inaccurate.  Taking 
into  consideration  the  existence  of  extensive  flat  areas  on  error  surfaces  (see  (2)),  we  see  the  importance  of 
a  good  choice  for  k  (in  flat  areas  the  first  case  of  (4)  tedces  effect).  A  ‘good  choice’  for  k  can  only  be  made 
after  a  couple  of  experiments. 

Ideally,  k  should  be  set  to  a  different  (appropriate)  value  for  each  weight  and  each  step.  Therefore  we 
introduce  a  fuzzy  control  of  k  in  (4). 


3.3  The  Momentum  Version 

This  modification  of  the  generalised  delta  rule  leaves  the  (single)  learnrate  unchanged. 

At  step  t  each  weight  w(t)  (no  indices  for  convenience)  is  adjusted  according  to; 

w(t  +1)  =  w(t)  +  Aw(t)  (5) 

Aw(t)  =  -(l-a)q|^  +  oA«;(t-l)  =  -(1  - «);,  ga*  (6) 

where  a  £  [0, 1]  (referred  to  as  the  momentum  term)  and  i?  is  the  learnrate.  Aw(t  —  1)  gives  the  amount  by 
which  the  weight  w  was  changed  during  the  previous  step.  Typic^dly,  a  is  set  ss  0.9.  This  is  an  arbitrary 
choice  and  may  have  to  be  revised  after  a  couple  of  experiments. 


in- 342 


When  the  derivatives  have  the  same  sign  on  consecutive  steps,  the  sum  in  (6)  grows  larger  (causing  the 
change  in  w  to  be  greater),  otherwise  it  stays  small  (causing  smaller  change  in  w). 

The  momentum  version  has  chiefly  two  weak  points: 

There  may  exist  an  upper  bound  on  the  sum  in  (6)(if  all  derivatives  are  =  0,  for  instance);  thereby  the 
greatest  amount  by  which  a  weight  can  be  changed  is  limited,  which  is  not  desirable  in  certain  flat  areas  of 
the  error  surf8u:e.  In  addition,  the  sign  of  the  sum  starting  at  i  =  1  may  differ  from  the  sign  of  the  current 
derivative;  thus  —  in  an  extreme  situation  —  w  may  be  moved  in  the  wrong  direction. 


3.4  The  Hybrid  Rule 

The  hybrid  rule  uses  individual,  variable  learnrates  and  a  momentum  term.  The  learnrates  are  adjusted 
according  to  the  learnrate  updating  rule  (4)  of  the  S-6  rule.  The  momentum  version  (5)  (with  individual 
learnrates)  is  used  as  weight  modification  rule. 

Without  further  chwges  both  methods  do  not  cooperate  ideally  (which  is  what  Jacobs  observed  in  his 
comparison  of  pure  6-6  rule  with  the  hybrid  rule). 

On  the  one  hand  the  momentum-term  being  large  (i.e.  (1  —  a)  is  small),  causes  the  learnrate  to  be  less 
important  in  determining  the  weight  change  -  the  benefits  of  the  complicated  6-6  rule  take  only  little  effect. 
On  the  other  hand  the  effectiveness  of  the  momentum  version  on  flat  areas  of  the  error  surface  is  greater,  if 
a  is  large,  whereas  or  should  be  small  on  areas  with  high  curvature. 


4  The  Fuzzy  Controller 

From  our  observations  in  (3.2),  (3.3)  and  (3.4)  we  draw  the  following  conclusion:  The  hybrid  rule  (from 
(3.4))  can  be  improved  strongly  by  allowing  the  parameters  k  (of  the  6-6  rule)  and  a  (of  the  momentum 
version)  to  vary.  To  control  these  parameters  we  use  a  fuzzy  controller.  The  heuristics  for  adjusting  k  and 
a  are  easy  to  implement  this  way.  Also,  fuzzy  control  yields  flexible  outputs  and  we  do  not  have  to  think 
about  exponential  and/or  linear  de-  or  increases.  The  heuristics  we  used  in  the  construction  of  the  controller 
(see  (App.  B))  are: 

The  longer  the  weight  vector  is  in  a  flat  area  of  the  error  surface  the  larger  k  and  a  may  be  set.  k  should  be 
allowed  to  become  large  enough,  so  it  can  take  effect  despite  of  a  large  a.  In  areas  of  high  curvature  k  and  a 
should  be  small.  We  employ  a  Sugeno-type  fuzzy  controller  [5],  which  is  based  upon  the  ‘familiar’  IF. .  .THBI 
rules.  In  contrast  to  these  ‘familiar’  constructs  the  Sugeno-type  THEI  instructions  do  not  contain  any  fuzzy 
sets  but  crisp  functions.  The  output  of  the  controller  is  the  average  of  these  functions’  values.  As  a  result 
of  that,  no  defuzzification  has  to  be  performed  and  the  fuzzy  controller  works  faster  than  the  ‘familiar’  one. 

For  use  in  the  IF  clauses,  a  couple  of  fuzzy  sets  is  defined.  These  describe  the  current  curvature  of  the 
error  surface  (for  instance  Very  Low,  NotSure,  High).  In  a  new  variable  we  record  how  often  each  case 
in  (4)  is  selected.  This  new  variable  is  then  used  to  represent  the  curvature  of  the  error  surface. 

4.1  Computational  Expense 

The  new  learning  law  is  more  complicated  to  compute  than  the  generalised  delta  rule.  A  couple  of  extra 
additions,  multiplications  and  comparisons  have  to  be  executed.  These  sulditional  computations  take  less 
time  than  performing  one  complete  training  step.  The  additional  expense  is  justifiable  since  the  faster 
convergence  of  our  new  method  decreases  the  number  of  training  steps  which  have  to  be  computed. 


5  Conclusions 

We  modified  the  classic  backpropagation  algorithm  to  perform  a  parallel  coordinate  descent,  based  upon 
Jacobs’  ideas.  Heuristics  about  the  properties  of  the  error  surface  further  have  been  implemented  in  a  fuzzy 
control  of  the  parameters  k  and  a.  The  modified  net  needs  far  less  training  steps  to  ‘learn’  a  training  set, 
without  too  much  extra  computations.  In  addition  we  still  have  a  method  with  global  convergence. 
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Appendix  A:  Two  Tests 

Our  new  learning  law  (referred  to  as  fuzzy  hybrid)  is  tested  with  two  different  problems.  We  compare  the 
results  of  fuzzy  hybrid,  with  the  performance  of  the  generalised  delta  rule  and  the  S-6  rule.  The  parameters 
of  each  rule  were  set  to  values  that  lead  to  fastest  convergence. 

1.  The  net  is  trained  to  learn  the  XOR  function.  The  training  was  repeated  40  times,  starting  with  a 
different  weight  vector  each  time.  The  diagram  shows  the  average  of  the  results. 

2.  The  training  set  consisted  of  28  examples  of  /(z)  =  z^,  where  z  E  [1, 10].  The  diagram  shows  the 
(typical)  results  of  training  the  net  once. 

Note:  1^0  end  kq  are  starting  values  that  have  to  be  set  by  the  user. 

The  p-axis  of  each  diagram  is  scaled  logarithmically. 


number  of  baining*ne]M 


XOR 

leemiiig  law  parametera: 

generalised-delta-nde: 

»)  =  0.8 

ddita-bar-delta-nile: 

»»  =  0 
K  =  0.001 

d  =  0.5 
9  =  0.5 

fuzzy-hybrid: 

*)o  =  0 
Ko  =  0.001 
^  =  0.1 
9  =  0.5 


/(r)  =  I* 

learning  law  parameterz: 

generalised-delta-rule: 
rt  =  0.001 

delta-bar-ddta-nile; 

*»  =  0 
K  =  0.0005 
d  =  0.8 
9  =  0.5 


fuzzy-hybrid: 
no  =  0 
KO  =  0.0005 
d  =  0.8 
9  =  0.5 
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Appendix  B:  Detaib  of  the  Fuzzy  Controller 

The  vtdue  of  the  variable  cCi]  describes  the  curvature  of  the  error  surfaute  in  the  t-th  dimension  (the 
corresponding  weight  is  Wi); 


.rn  —  /  ^ 

-X  c[i]-5  .ifi^ 


was  increased  by  k  , 
was  decreased  by  —  ^  i)i(t) 


K  and  ^  are  parameters  of  the  6-6  rule  and  tn  is  the  learnrate  corresponding  to  Wj. 
Additionally,  we  make  certain,  that  c[l]£  [—1, 100]. 

K  and  a  are  controlled  by  the  following  rules: 


IF  (cCi] 


(V)eryLow 

'  V(cCi])*100Ko  1 

'  V(c[i])*0.9  ' 

(L)ow 

>  )  THEI  K  :=  • 

L(cCi])*10»co  1 

L(eCi])*0.7 

(N)otSure 

N(eCi])*/co 

N(c[i])*0.3 

.  (H)igh 

H(c[i])*ao/10  J 

H(cCi])*0.01 

icq  is  a  starting  value  that  has  to  be  set  by  the  user. 

(k  and  a  are  computed  newly  for  each  weight  at  each  step,  so  their  values  do  not  have  to  be  stored.) 


The  four  fuzzy  sets  are  defined  by: 
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ABSTRACT 

bi  many  domains  where  the  data  belongs  to  a  multidiinensioDal  vector  space,  there  is  a  need  to  have 
vectm*  rq}resentation  and  vectw  processing  of  signals.  This  paper  gives  a  new  tool  to  do  this:  The  multi- 
l&yer  vector  neural  network.  In  this  architecture,  the  neuron  inputs  and  outputs  are  vectms,  tihe  weights 
ate  matrices,  and  the  activation  functions  are  vector  functicxis.  The  trainiiig  technique  presented  in  this 
paper  is  the  vector  back  prxjpagation  algorithm  vdiich  is  a  generalization  of  the  scalar  back  prc^)agation 
algorithm. 


1.  INTRODUCTION 

Real-valued  neural  networks  (RNN)  have  been  applied  to  many  fields  [14]  such  as  signal  processing, 
pattern  recognitioo,  vector  quantization,  function  approximation...  These  neural  network  models  have 
very  interesting  properties:  Parallel  (hstributed  processing  [21],  self  organization  [13],  universal 
sqtproximation  [4,  9,  11,  17],  best  sqiproximatkm  [10],  first  ad^tive  filtering  [12] ...  This  ^ows  RNN 
to  have,  in  general,  better  performances  than  ncm  neural  processiiig  tools.  The  propagati<m  (BP) 
algorithm  [16, 21, 22]  is  one  of  the  most  popular  algorithms  used  for  the  learning  process. 

Recently,  s«ne  authors  have  proposed  complex-valued  neural  networics  (CNN)  and  extended  the  BP 
algorrthm  to  the  conqrlex  plane  [5,  8,  15].  These  architectures  can  be  applied  to  domains  vhere  signals 
have  oxnplex  representation  (e.g.  signal  processing  and  digital  communications)  [5,  6, 7]. 

In  many  domains,  like  array  signal  processing,  the  data  belongs  to  a  vector  s  and  its  treatment  needs 
vector  mappirigs.  It  is  of  a  great  importance,  therefore,  to  have  neural  netwoiKS  that  allow  not  oidy  a 
vector  r^resentation  of  signals,  but  also  a  vector  processing  of  diem.  Scalar-Valued  Neural  Networks 
(i.e.  RNN  and  CNN)  can  not  allow  both  vector  representation  and  vector  processing  because  die 
activadm  fiinctioos  are  scalar  functions. 

hi  this  p^ier  we  present  die  multi-layer  vector  neural  network  (MVNN).  In  this  architecture,  the  neurrm 
iiqmts  and  outputs  are  vectors,  the  weights  are  represented  by  matrices,  and  the  activation  fiinctions  are 
vector  functions.  The  training  algoridun  is  called  die  Vector  Bade  Propagation  (VBP)  algorithm  because 
it  prqpagates  error  vectOT  terms  in  a  backwards  fosbion.  The  cl^ic  BP  algorithm  [16,  21,  22] 
ooneqionds  to  die  VBP  vriien  all  variables  are  scalars  and  the  activation  functions  are  defined  in  die 
real  (M’ the  complex  domain. 

The  VBP  algoridun  can  be  usefiil  in  many  applications  dealing  with  vector  data  like  vectcu^  quantization, 
array  signal  processing,  principal  component  analysis,  digital  communications  over  non  lin^  channels 
(with  conplCT  valued  signals),  qieech  coding,  classification  and  pattern  recognition,  vector  function 
rqiproximation,  learning  under  constraints,  robotics,  parallel  distributed  processing ... 

Ihis  paper  is  restricted  to  MVNNs,  an  analogous  analysis  concerning  vector  sdf  (uganisatiixi  aiid 
associative  rmqis  (VSAMs)  is  discus^  in  [18].  VSAMs  generalise  Kobonen  feature  maps  (KFM)  and 
die  adsqitive  resonance  die^  (ART)  to  multidiinensional  vector  spaces. 

Since  vectiU'  spaces  are  linear  manifidds,  both  MVNNs  and  VSAMs  are  special  cases  of  the  manifold 
neural  netwcuk  (MNN)  presented  in  [18, 19]. 
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The  paper  is  organized  as  follows;  In  part  2  we  describe  die  MVNN  aididectufe.  Part  3  gives  die  VBP 
learning  luk.  Finally,  some  ooiK.l»ding  lemaiks  are  given  in  part  4. 

2.  THE  MULTI-LAYER  VECTOR  NEURAL  NETWORK 

The  lettos  R  and  C  will  denote  the  set  of  real  and  coo^ilex  numbers,  reflectively.  Let  ip  be  eidier  R  or 
C.  Eg  denotes  a  K  dimensional  vector  space  over  <I>.  Bg  =  (ei,..,eg)  denotes  the  canonic  basis  of  Eg 
wlii^  is  supposed  to  be  nonned  by  means  ofdie  usual  Euclidean  metric.  L{Eg^,Eg^)  denotes  the  field 
of  linear  mappings  fiom  Eg^  to  Eg^.  F{Eg^  ,Eg^)  denotes  the  field  of  vector  fiinctions  defined 
fiom  Rjfj  to  Rjcj. 

The  multi-layer  vector  neural  network  c(»sists  of  many  vector  units,  mr  vector  neurons.  A  vector  neuron 
is  composed  of  a  vector  linear  combiner  £  and  a  vector  activation  fiincdon  ^  : 


Figure  1 ;  The  vecUv  neunm 


Note  that  the  indices  i  and  it  are  not  important  here.  They  refer  to  the  layer  index  and  the  neuron  index, 
respectively,  see  figure  2. 

The  input  is  composed  of  N(i-l)  vectms  of  a  vector  space  Eg^j.iy  The 

wei^lits  {l^},l ^  J ^  N(i  - 1)  are  K(i)  x  K(i - 1)  matrices  (di^  correfxmd  to  matrix  representations 
ofN linear  mappings  of  L(£^i^(/-i},£jr(/)))-  'Hie  bias  bjjf  belongs  to  Eg^J^).  The  linear  combiner  sums 
the  bias  and  the  input  vectors  multiplied  by  the  corresponding  matrix  weights: 

/*! 

The  activation  fimcdmi  fjg  mfis  the  vector  netjj^  into  a  vector  % : 


fik  Fg(j)  ->  Eg^iy 

N^\) 

In  the  caixmic  basis  Bg^i),  vector  %  can  be  wiittoi 

r  1 


Note  that  the  function  (corresponding  to  the  direction  e^),  is  function  of  the  vector  ner^  (and  not  of 
die  scalar  f^).  This  is  the  main  difference  with  SNNs. 
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ffjg  can  be  any  vector  function  (dififerendable  or  not).  For  example,  it  can  be  tfae  generalized 
characteristic  function  (in  die  case  of  vector  spaces  defined  over/?) ; 

^ fbrall  l^/^/r(0 
=  [0..0f  ifne/4^0  for  all 

An  other  exanqile  of  ,  is  the  multidimensional  quadratic  sigmoidal  functioa: 

[  thinetjiiAinetn)  ] 


[th(net‘nAx(iynetn)j 

vidiere  {Ap],  \£p^  K(i)  are  K(i)  x  K(i)  matrices  and  /A  is  the  hyperbolic  tangent  function. 
The  multi-layer  vector  neural  network  is  shown  in  figure  2. 


X 

LI 


X 


Input  layer  0  Layer  t  Output  layer  L 

Figure  2  :  Multi-layer  Vector  Neural  Network. 

In  figures  1  and  2,  i  denotes  die  layer  index,  %  the  output  of  neuron  k  of  layer  i.  the  matrix  weight 
that  links  vector  to  neuron  k  of  layer  i,  N(i)  the  number  of  neurons  in  layer  /,  and  Ex{i)  the  vector 
space  to  which  belong  layer  /  outputs. 

The  network  woiks  as  fi^lows:  N(0)  vectors  (of  a  vector  space  £‘x(0))  are  fed  to  the  network  input  and 

ate  processed  in  a  feed  fixward  feshion  through  intermediate  vector  spaces:  Each  layer  r  has  in  its  inputs 
Nfi-J)  vectors  (of  the  vector  qiace  £x(/-i))  and  maps  them  into  N(i)  vectors  (of  the  vector  space ). 

If  we  denote  by  'Fj  this  mappti^,  then  we  have 


fe,,,)"'" 

(*<1» 


The  global  network  input-ouqiut  moping  represents  a  function  ^  which  belongs  to  the  set  of  vector 
fhnctioos 

'f  -  (%o,r'  -»  (£««)""•’ 

(*01,  ,J50Ar(0))  ->  (*L1, 

If  all  variables  are  scalars  (Ex(i)  =  ^)>  reduced  to  an  SNN.  Therefore,  the  mapping 

'Fsnn  belongs  to  F(<I>^<°>,<I)^(^>). 

Many  algorithms  can  be  used  to  train  the  VNN  (in  supervised  or  unsupervised  leamit^  procedures),  in 
this  paper  we  describe  only  the  vector  back  propagation  algorithm  which  is  a  generalization  of  the  scalar 
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back  propagirtioa  algoridtiii.  Odier  algwidmis  used  dv  SNN  (see  for  instance  [14,  16])  can  also  be 
generalized  to  VNN. 

3.  THE  VECTOR  BACK  PROPAGATION  ALGORITHM  (VBP) 


h  supervised  leaniing,  ff(0)  input  vectras  {zoi— •>aDA^(0)}  (of  a  vectrar  space  Ai^(0))>  ^(L) 

desired  output  vectors  { (of  a  vector  space  are  presented  to  foe  neural  network  at 

fonet.  The  goal  is  to  at^ust  foe  network  weights  in  order  to  minimiBe  a  cost  function  4- 
We  define  foe  crrra  vectors  Sj(t)  at  tiine  t  as  foe  difiference  between  foe  desired  output  vectors 
dj{t)wd  foe  output  vectors  xij(t): 

ej(t) = dj(t) - xi^(t) ,  N(L) 


The  cost  iunctioo  is  defined  as  foe  sum  of  foe  nram  of  foese  error  vectors: 


y*i  /*!  y=i  /»! 


where  denotes  foe  ooiiiugate  of  vector  ey,and  ^  denotes  foe  transpose. 

Without  loss  of  generality,  we  suppose  foat  =  A  similar  study  can  be  done  fiv  foe  case  fo  =  C  (one 

should  calculate  foe  derivatives  accordiitg  to  foe  conpigate  variables). 

We  suppose  that  foe  vector  activatioo  fimctioiis  are  all  difibentiable. 

3.1  Notatkm 

We  denote  by  J(fik)  the  dififeiential  matrix  of  with  respect  to  vector  netj^ : 

M. 


dnei 


,  1£/>^IC(0 (line index)  ,  l^/^itr(0  (column index) 


We  denote  by  Jg^(^  foe  difierential  matrix  of  ^  wifo  respect  to  foe  matrix  1^: 

^  1 


,  l^m^A(0 (line index) ,  IsSn^AO’-l)  (cdumnindex) 


We  denote  by  J^(^  foe  gradient  of  ^  wifore^rect  to  vectrali^  : 

^4 


We  denote  by  foe  gradkot  of  4  wifo  respect  to  vector  : 


9 


(line index) 


(line  index) 


3.2  The  VBP  Algorifom 

The  VBP  a^orifom  minimizes  foe  cost  fimctioa  4  by  updating  wri^its  and  biases  accordu^  to  foe 
gradient  search  tedmique: 

+  (2) 

W  +  (3) 

/I  is  a  small  positive  constant 


Thetenn  caabeaq>ressedasafuiictknof3tenns; 


(4) 


TlieseooadtBnnofth6r^bthandside(RHS)of(4)  called  error  tenn,  dq)eDds  oo  the  cost  functko 
tibe  otfier  two  dqMod  only  on  layer  i  panuneters  (i.e.  the  activation  function  and  the  nqntt  vectors). 

The  term  can  be  expressed  as  a  function  of  two  terms  (S).  The  first  dqiends  only  on  byer  i,  the 

second  is  the  error  term 

^  (5) 

The  error  terms  are  computed  efficiently  by  starting  with  die  ou^ut  byer: 


r  1 

^Lk  = 

m)K(L) 

=  I  I 

^Lk 

y=i  p=i 

=  ~2st 


(6) 


and  working  backwards  tiiroi^  the  other  byers: 


JV«+i) 

4lk=  ^  ^•+u* 

*•*1 


(7) 


The  Vector  Back  Propagation  Algorithm: 

Step  1.  Initialize  matrix  weights  and  vectm  r^Bsets: 

Set  all  matrix  weights  to  small  values  (aocordmg  to  a  matrix  norm),  and  all  vector  biases  to 
smaU  values  (aocordmg  to  the  Euclidean  norm). 

Step  2.  Present  inputs  and  desired  outputs: 

Present  N(OJ  input  vectors  { r>01>  '  >^^(0)}  (t>f  ^  vector  qirace  Ex(0))>  ^  desired  output 

vectors  {<fi,...,r/jV(L)}  (ofa  vector  space  Ex(L)) 

Step  3.  Calculate  actual  outiwts: 

Use  the  fimmda  (1)  to  calculate  the  outimts  { 

Step  4.  Adqit  weights: 

Wj^(t  +  l)  =  wj»(0-;r(j(/»)y  <y*x/_y 
6ik(f+l)=A>t(0-M(JUk)y^ 

Sa  is  calculated  as  follows: 

I 

For  the  output  byer  s-  ~^Lk^  (aX(LJ-dm.  vector) 

JV^l) 

For  byer  2^  (j(^+l*')^^+ltt')  (a  ^(O’dim.  vector) 

P-1 

Step  S.  Repeat  by  going  to  stq>  2. 


4.  CONCLUSION  AND  FUTURE  WORK: 

hi  tins  pqrer  we  have  presented  tire  Multi-byer  Vector  Neural  Network  (MVNN)  whose  neuron  inputs 
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and  ou^wts  are  vQcCon,  weiglitt  ate  matrices,  an!  actK«tk»  fimctwof  are  vector  fuDcti^ 
WafJfPfnp^fl^*in«i(VWP)»lgnriAm,  iiwhidtM  a  gmeralijarinw  In  the  acalar  one,  was  also  presented. 
Otber  algoridims  and  modds  used  in  Scaiar*vahied  Neural  Networks  (SNN)  (see  fitr  instance  [3, 3, 16, 
21])  can  also  be  extended  to  MVNN.  b  fiudier  ctanmunications  we  try  oqtlore  the  mathematical 
properties  of  VNN,  like  those  obtained  for  SNN  (see  for  instance  [1,  10,  20]).  Mai^r  questkms  have  to 
be  answered:  Can  the  universal  ai^roximatioo  [11]  be  generalized  to  MVNN,  and  vbich  activatioa 
fonctians  are  able  to  do  diis?  How  to  extend  the  kaming  curve  models,  studied  fix' SNN  (e.g.  [2, 12, 13, 
20]),  to  foe  muhidimensioaal  case?  Note  that  foe  universal  approximation  of  SNN  was  demonstrated 
after  1988! 

The  answers  to  these  questions  are  not  yet  known.  We  foink,  however,  foot  VNN  can  be  an  inqxxtant 
tod  to  treat  nuihidiniettsional  problems  (see  the  computational  examples  given  in  [18]). 
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Abstract 

In  the  present  paper,  we  propose  a  method  of  entropy  maximisation  to  improve  general¬ 
isation.  Ebr  good  generalisation,  the  strength  or  the  number  of  weights  must  significantly 
be  reduced  and  the  input  patterns  must  be  represented  over  many  hidden  units.  In  other 
words,  the  cost  (represmted  in  the  number  or  the  strength  of  weights)  must  be  as  small 
as  possible,  while  the  diversity  of  hidden  unit  activity  is  as  large  as  possible.  The  diversity 
can  be  r^resented  by  an  entropy  (H)  with  respect  to  hidden  unit  activity.  The  cost  (C) 
can  be  the  average  of  the  squared  weights.  Then,  to  obtain  the  better  generalization,  the 
ratio  {HfC)  must  be  as  large  as  possible.  We  formulated  a  learning  rule  to  maximize  this 
ratio  and  applied  it  to  the  identification  of  frequencies.  The  results  confirmed  that  the 
ratio  of  entropy  to  the  cost  was  increased  and  the  generalization  performance  was  greatly 
improved.  In  addition,  the  learning  time  was  significantly  improved  by  using  the  entropy 
method. 


1  Introduction 

Neural  networks  can  create  the  internal  representation  or  hidden  unit  activity  patterns  in  the  course  of  the 
learning.  The  information,  contained  in  input  patterns  is  recoded  in  distributed  ways  or  represented  over 
many  hidden  units.  Because  of  the  distributed  representation,  the  networks  can  appropriately  generalize 
to  novel  situations  [4].  Thus,  the  improvement  of  generalization  performance  depends  upon  the  recoding  of 
input  patterns  at  hidden  layer.  In  addition,  it  has  been  well  known  that  a  weight  decay  method  and  weight 
pruning  contribute  significantly  to  the  improvement  of  the  generalization  [5],[6],  [7],  [8].  To  achieve  good 
generalization,  the  number  of  weights  or  the  strength  of  weights  in  networks  must  be  small. 

To  represent  concretely  this  condition,  let  us  define  an  entropy  function  (H)  with  respect  to  hidden  unit 
activities, 

H  =  -  '^Pilogpi, 
i 

where  p;  b  a  normalized  hidden  unit  activity  and  the  summation  b  over  all  the  hidden  units.  If  thb  entropy 
b  maximized,  all  the  hidden  units  ate  equally  or  uniformly  activated.  On  the  other  hand,  if  thb  entropy  b 
minimized,  only  one  hidden  unit  b  activated  and  all  the  other  units  are  off.  To  achieve  better  generalization, 
entropy  must  appropriately  be  increased,  that  b,  hidden  units  are  activated  over  many  input  patterns.  Now, 
let  us  formulate  the  average  cost  for  each  hidden  unit  or  weight  strength  (C)  as 

% 

where  Ci  b  the  sum  of  the  squared  weights  into  tth  hidden  unit,  M  b  the  number  of  hidden  units.  Thb 
average  cost  must  be  as  small  as  possible  and  entropy  for  hidden  units  must  be  as  large  as  possible.  Thus, 
if  the  ratio  (G) 
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is  muimized,  the  generalization  will  be  improved. 

In  the  following  sections,  we  formulate  the  maximization  method  of  the  ratio  of  entropy  to  the  cost. 
Then,  we  apply  the  method  to  the  identification  of  frequencies.  We  show  that  the  ratio  of  entropy  to  the 
cost  is  significantly  increased  and  the  generalization  is  greatly  improved  in  all  cases.  Finally,  the  comparison 
with  the  simple  weight  decay  method  is  presented. 


2  Theory  and  Computational  Methods 


2.1  Entropy  FVinction 

Let  us  formulate  the  entropy  maximization  for  the  improvement  of  generalization  performance.  Suppose 
that  a  network  is  composed  of  three  layers:  input,  hidden  and  output  layers.  Hidden  unit  activities  are 
denoted  by  Vi  and  input  terminals  by  (j.  Then,  connections  from  inputs  to  hidden  units  are  denoted  by  Wij 
and  connections  from  hidden  units  to  output  units  are  denoted  by  Wij. 

A  hidden  unit  produces  an  output 

Vi  =  f(Vi), 


where  Uj  is  a  net  input  to  tth  hidden  unit  and  defined  by 

t 

j=i 

where  (i  is  tth  element  of  an  input  pattern  and  L  is  the  number  of  elements  in  the  pattern.  An  entropy 
function  at  hidden  layer  is  defined  by 


M 


«=1 


(1) 


where 


Pi  = 


]Cm 


where  the  summation  is  over  all  the  hidden  units. 


2.2  Input-hidden  Connections 

The  cost  can  be  measured  by  the  total  sum  of  the  squared  weights.  Thus,  the  cost  for  ith  hidden  unit  is 
defined  by 

i=i 

where  the  summation  is  over  all  the  input  units  (L  input  units).  Thus,  the  average  cost  per  unit  is  computed 
by 

M 

A  function  to  be  maximized  is  the  ratio  of  the  information  entropy  to  the  average  cost: 


r  - 

^  -  c 


_  -Af  E<y»io6Pi 
Ej  <• 


(2) 


By  differentiating  both  sides  of  this  equation,  we  have 
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dwij  ~  E<  E,  *5, 

~2MJ2iPilQgpi 


where  4%  u  defined  by 

Update  rule  is  formulated  as  follows; 


Awij 


^dG  dE 

diVij  ^  dv>ij 

1  H 


where  a  =  cM  and  Si  is  the  ordinary  delta  for  the  back-propagation. 


(3) 


2.3  Hidden-Output  Connections 

Let  us  formulate  a  cost  for  the  hidden-output  connections.  A  cost  for  ith  hidden  unit  is  formulated  by 

•=i 


where  Wi^  is  a  hidden-output  connection  and  the  summation  is  over  all  the  output  units  {N  output  units). 
Thus,  an  average  cost  is 


D 


1  " 


, 

j=» 


,  M  N 


A  function  to  be  maximized  is  defined  by 


Differentiating  both  sides  of  this  equation  with  respect  to  we  have 


dF 

^dWij 


(5) 


3  Results  and  Discussion 

3.1  Identification  of  frequencies 

We  applied  our  method  to  the  identification  of  frequencies[8].  Networks  must  identify  three  frequencies  of 
sine  waves  with  phases  shifts,  different  from  those  of  truning  data  sets.  First,  training  data  were  divided 
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TVible  1:  Summary  of  experiments  for  network  with  nine  hidden  units. 
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into  three  classess  with  three  different  frequencies  (2,  4,  6).  Each  class  has  ten  exampiars  with  sixty-four 
smaples  from  sine  waves.  Thus,  the  number  of  input  units  was  sixty-four.  The  number  of  output  units  was 
three  and  specific  target  values  were  assigned  to  each  output  unit,  according  to  the  frequency  of  a  class.  Let 
us  take  an  exmaple  of  frequency  2.  The  class  of  frequency  2  have  a  target  output  (0,0,1)  and  input(^i)  are 
given  by 

=  y4sin(2i|^+p), 

where  i=l,2,...,64,  A=0.8  or  1.0,  and  p=  2jjr/5,  j=l,2,3,4.  Total  number  of  trsuning  data  was  thrity 
exmaplars.  Following  the  specfication  of  Siestma  et  al.  [8],  test  data  was  made  with  phase  shift  p=  (2j-l)x/5, 
i=l,2,3,4.  However,  the  amplitudes  were  set  to  0.7,  0.9  and  1.1.  Thus,  toted  number  of  training  sets  was 
forty-five,  compared  with  fifteen  of  Siestma’s  original  data.  In  addition,  input  sine  waves  were  modified  by 
using  the  sigmoid  function.  Experiments  were  performed  as  follows.  First,  only  input-hidden  connections 
were  updated  so  as  to  increase  entropy.  Only  when  the  generalization  performance  was  not  significantly 
improved,  in  addition  to  the  update  of  input-hidden  connections,  hidden-output  connections  were  used  to 
increase  entropy.  Because  the  update  by  input-hidden  connections  were  very  stable,  while  the  update  of 
all  the  connections  tended  to  be  unstable.  Several  times,  same  experiments  were  perfomed  with  different 
intiai  values  for  weights,  ranging  between  -0.5  and  0.5.  The  learning  was  considered  to  be  finished  when  the 
absolute  differences  between  targets  and  ouputs  were  ail  below  0.2.  Though  the  final  Hamming  distance  and 
the  sum  of  squared  errors  were  dependent  upon  chosen  initial  values,  approximately  the  same  improvement 
(relative  improvement)  in  the  generalization  performance  was  obtained.  Thus,  in  the  following  sections,  only 
one  typical  result  was  given  for  each  experiment. 

3.2  Generalization  Performance 

Table  1  shows  the  summary  of  experimental  results,  when  the  number  of  hidden  units  was  nine.  Only  input- 
hidden  connections  were  used  to  increase  entropy.  To  facilitate  the  adjustment  of  the  parameter  or,  values  for 
the  parameter  a  was  divided  by  the  maximum  entropy:  log  M.  HD  means  the  Hamming  distance  between 
targets  and  outputs,  averaged  over  all  the  input  patterns  and  all  the  elements  in  the  patterns  and  ranged 
between  zero  and  one.  Outputs  greater  than  0.5  were  set  to  one,  while  outputs  less  than  or  equal  to  0.5  were 
set  to  zero.  One  epoch  means  thirty  presentations  of  input  patterns.  As  can  be  seen  in  the  table,  the  ratio  of 
entropy  to  the  average  cost  (O)  increases  significantly  from  3.16  to  3.79.  Average  hamming  distance  between 
targets  and  outputs  decreases  from  4.44  to  zero,  meaning  that  the  network  can  produce  perfectly  targets.  In 
addition,  the  number  of  epochs  needed  to  finish  the  learning  decreases  as  the  parameter  a  decreases. 

Then,  we  increased  the  number  of  hidden  units  from  nine  to  fifteen.  Table  2  shows  experimental  results, 
when  the  number  of  hidden  units  was  fifteen  and  only  input-hidden  connections  were  used  to  maximize 
entropy.  As  the  parameter  a  increases,  the  ratio  of  entropy  to  the  cost  increases  significantly  from  5.74  to 
8.66  for  input-hidden  connections  and  from  11.03  to  18.01  for  hidden-output  connections.  Average  Hamming 
distance  decreases  from  4.44  to  2.22.  Then,  we  used  all  the  connections  to  increase  the  entropy.  Table  3 
shows  experimental  results  when  all  the  connections  were  used  to  increase  the  entropy.  As  can  be  seen  in 
the  table,  the  Hamming  distance  decreases  greatly  from  4.44  to  zero,  meaning  that  networks  can  produce 
targets  perfectly.  In  addition,  we  can  see  that  the  number  of  epochs  for  finishing  the  learning  is  significantly 
decreased  from  237  epochs  (by  standard  back-propagation)  to  119  epochs  at  the  end. 

The  number  of  hidden  units  was  increased  from  fifteen  to  twenty-one  hidden  units.  In  this  case,  we  can 
also  see  a  significant  increase  in  the  generalization  performance  by  entropy  maximization.  Table  4  shows 
results  when  the  number  of  hidden  units  was  twenty-one  with  updates  of  input-hidden  connections  to  increase 
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Tkble  2:  SummMy  of  «xp«rim«aU  for  networks  with  fifteen  hid¬ 
den  units  only  with  updstes  of  input-hidden  connections. 
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Thble  3:  Summary  of  exMriments  for  networks  with  fifteen  hid¬ 
den  units  with  updates  of  all  the  connections  to  increase  entropy. 
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the  entropy.  As  the  parameter  or  increases,  the  ratios  of  entropy  to  the  average  cost  significantly  increase. 
The  Hamming  distance  decreases  from  10.37  to  3.70.  The  number  of  training  epochs  decreases  from  157  to 
135  epochs.  Then,  we  used  all  the  connections  to  decrease  the  Hamming  distance.  As  can  be  seen  in  Table 
5,  computed  with  all  the  connections,  the  ratio  (G)  increases,  while  the  ratio(F)  does  no  increase.  However, 
finally  Hamming  distance  decreases  and  reaches  the  level  of  2.22.  The  number  of  epochs  also  decreases 
significantly  from  157  (by  standard  back-propagation)  to  104  epochs. 

3.3  Simple  Weight  Decay 

It  has  been  well  known  that  adding  a  simple  weight  decay  term  increases  the  generalization  performance 
[6].  Let  us  compare  the  performance  by  entropy  method  with  that  by  the  simple  weight  decay  method.  The 
weight  decay  method  can  be  formulated  as 

where  means  all  the  connections,  including  hidden-output  connections.  Table  6  shows  experimental 
results  by  using  the  weight  decay  method.  As  can  be  seen  in  the  table,  the  minimum  Hamming  distance(1.48) 
U  larger  than  the  minimum(O)  obtained  by  entropy  method.  See  Table  1.  In  addition,  the  number  of  epochs 
to  be  needed  for  the  learning  increases  significantly,  as  the  parameter  A  increases.  Even  if  the  sum  squared 
error  (SSE)  decreases  below  the  minimum  level  (2.06),  obtained  by  the  entropy  method,  the  Hamming 
distance  can  not  decrease  to  the  minimum  level  (zero)  by  the  entropy  method.  As  shown  in  the  table,  the 
ratio  (fj)  and  (F)  increase  greatly  as  the  parameter  increases.  Thus,  the  weight  decay  method  is  only  a 
method  to  increase  the  ratio  with  fixed  entropy  values. 


Table  4;  Summary  of  experiments  for  networks  with  twenty-one 
hidden  units  only  with  updates  of  input-hidden  connections  to 
increase  entropy. 
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Ikble  5;  Summary  of  experiments  for  network  with  twenty-one 
hidden  units  with  update  of  all  the  connections. 
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Table  6;  Summary  of  experiments  for  networks  with  nine  hidden 
units  with  a  simple  weight  decay  method  was  used. 
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4  Conclusion 

In  thb  paper,  we  have  proposed  an  entropy  maximization  method  to  improve  the  generalization  performance. 
To  achieve  the  better  generalization,  the  information,  contained  in  input  patterns  must  be  represented  over 
many  hidden  units.  The  number  and  the  strength  of  weights  must  significantly  be  small.  We  have  formulated 
a  method  to  increase  the  ratio  of  entropy  to  the  cost  (total  sum  of  squared  weights).  We  have  shown  that  the 
ratios  are  significantly  increased  and  the  generalization  performance  is  greatly  improved.  In  addition,  the 
learning  time  is  significantly  improved,  compared  with  the  weight  decay  method.  By  maximizing  entro]^, 
networks  try  to  use  as  many  hidden  units  as  possible  to  finish  the  learning. 
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Abstract 

We  introduce  in  this  paper  the  boolean  sphere,  a'  visualization  of  the 
solution  re^ons  for  perceptron  learning.  Linear  threshold  elements  with  n 
boolean  inputs  can  be  used  to  compute  a  subset  of  the  2^  boolean  func¬ 
tions  definable  on  n  inputs.  Each  one  of  these  linearly  separable  boolean 
functions  corresponds  to  a  region  on  the  surface  of  a  sphere  defined  in 
weight  space.  Perceptron  learning  can  be  thought  of  as  the  process  of 
examining  the  solution  regions  on  the  boolean  sphere.  The  case  of  two 
boolean  inputs  provides  a  nice  graphical  illustration  of  perceptron  learn¬ 
ing  and  its  convergence.  Moreover,  symmetry  considerations  show  why 
perceptrons  are  more  easily  trained  using  bipolar  than  binary  vectors. 
The  boolean  sphere  allows  us  to  give  some  estimates  of  the  complexity  of 
learning  problems. 


1  Perceptrons  and  linear  separability 

Linear  threshold  elements  have  attracted  attention  in  the  last  years  as  build¬ 
ing  blocks  for  artificial  neural  networks  and  it  is  in  this  context  that  they  are 
called  percepirons.  Their  properties  have  been  extensively  studied  and  there 
exist  learning  algorithms  to  train  them  efficiently.  A  perceptron  with  n  inputs 
*1,  *2)  ■  •  • ,  *n  and  n  -b  1  associated  weights  toi ,  tn2, . . . ,  tun+i  outputs  a  one  if 

xiwi  -I- - b  x„Wn  +  u^n+1  <  0  and  a  zero  otherwise.  In  our  notation  — ty„+i 

is  what  is  normally  called  the  threshold  of  the  perceptron.  Boolean  functions 
of  n  arguments  computable  by  a  perceptron  with  n  -b  1  parameters  are  called 
linearly  separable.  It  is  well  known  that  of  the  2^'  possible  boolean  functions  of 
n  variables  only  a  vanishing  percentage  of  them  is  computable  by  a  perceptron 
when  n  goes  to  infinity  [1].  This  fact  can  be  related  to  the  number  of  regions 
delimited  in  weight-space  by  the  hyperplanes  defined  by  the  input  vectors  in  the 
training  set.  By  defining  the  boolean  sphere  we  can  attempt  to  actually  measure 
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the  relative  sizes  of  each  region.  This  allows  us  to  meUre  certain  predictions  re¬ 
garding  the  learning  difficulties  of  perceptrons  and  perceptron  networks.  Before 
doing  this  we  have  to  take  a  closer  look  at  the  solution  regions  for  perceptron 
learning. 

Let  us  suppose  that  we  w^ult  to  find  the  perceptron ’s  parameters  for  the 
computation  of  a  given  boolean  function  fofn  arguments.  There  are  2"  different 
possible  inputs  which  can  be  enumerated.  Let  us  call  /,  the  value  of  /  for  the 
i-th  input  vector.  The  n  -b  1  parameters  for  each  of  2"  possible  inputs  must 
satisfy  the  inequality 

x\wi  -b  x'^W2  + - b  >  0 

if  /j  =  1,  or  the  inequality 

x\wi  +  X2W2  H - b  a:J,tUn  <  0 

if  fi  =  0.  Here  a;J  stands  for  the  j-th  bit  of  the  i-th  input  vector.  A  learning 
algorithm  should  be  able  to  find  the  n  -b  1  necessary  weights  if  they  exist. 

We  can  visualize  this  problem  either  in  the  n-dimensional  input  space  as  the 
question  of  separating  the  positive  from  the  negative  examples  with  a  hyper¬ 
plane,  or  we  can  visualize  it  in  weight  space  as  the  problem  of  finding  a  point 
(u/i,  W2, .  - . ,  Wn+i)  which  fulfills  the  above  2"  inequalities.  The  first  alternative 
is  the  traditional  approach  used  in  most  textbooks.  The  second  alternative  is 
more  interesting  because  it  allows  us  to  look  at  the  inner  working  of  training 
algorithms. 

Each  one  of  the  inequalities  referred  above  represents  a  cut  through  the 
origin  with  a  hyperplane  of  dimension  n  of  the  n  -b  1  dimensional  weight  space 
corresponding  to  the  t-input  x\,...,x*„.  Weight  space  is  thus  divided  into  a 
positive  and  a  negative  halfspace.  Weight  combinations  (iui,ui2, . . . ,  tUn+i)  in 
the  positive  halfspace  produce  the  perceptron  output  />  =  1  for  the  t-th  input. 
Weight  combinations  in  the  negative  halfspace  produce  the  perceptron  output 
fi  =  0  for  the  same  input. 

Perceptron  learning  amounts  to  finding  a  point  {w\,W2, .  ■ . ,  u)n+i)  in  weight 
space  which  lies  in  the  positive  halfspace  of  the  »-th  cut  whenever  /<  =  1  and 
in  the  negative  halfspace  whenever  fi  =  0.  This  is  the  dual  view  of  perceptron 
learning,  in  which  learning  amounts  to  finding  an  interior  point  in  an  inter¬ 
section  of  halfspaces  defined  by  all  possible  inputs  to  the  perceptron.  Since 
the  boundaries  of  the  solution  regions  ue  hyperplanes,  the  learning  problem 
amounts  to  finding  an  interior  point  of  a  poltype  defined  by  linear  constrmnts. 

Each  one  of  the  regions  defined  by  the  intersection  of  halfspaces  is  a  solution 
region  for  a  linearly  separable  booleam  function.  Not  all  solution  regions  have  the 
same  shape  and  an  interesting  problem  is  the  relative  volume  of  each  one.  Since 
perceptron  leautning  normally  starts  at  a  point  chosen  at  random  and  proceeds 
looking  for  the  interior  of  a  specific  solution  region,  it  is  intuitively  clear  that 
the  smaller  this  region,  the  harder  learning  should  be.  We  discuss  this  problem 
in  the  next  section. 
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Figure  1:  The  boolean  sphere  in  thi^  dimensions 


2  The  boolean  sphere 

One  way  of  looking  at  the  regions  defined  by  m  hyperplane  cuts  going  through 
the  origin  in  an  n-dimensional  space,  is  by  requiring  that  the  weight  vectors  for 
our  perceptron  be  normalized.  This  does  not  affect  the  perceptron  computation 
and  is  equivalent  to  the  condition  that  the  tip  of  all  weight  vectors  should  end 
at  the  unit  hypersphere  of  dimension  n.  In  this  way  all  the  convex  regions 
produced  by  the  m  hyperplane  cuts  define  solution  regions  on  the  ‘surface’  of 
the  hypersphere.  Figure  1  shows  an  example  for  the  case  of  the  perceptron  with 
two  binary  inputs.  Since  the  perceptron  uses  three  parameters,  weight  space 
also  has  this  dimension.  Note  that  four  possible  binary  inputs  define  four  cuts, 
but  four  cuts  in  the  three  dimensional  space  define  only  14  different  regions. 
This  means  that  only  14  of  the  16  possible  boolean  functioLs  of  two  arguments 
can  be  computed  by  this  perceptron. 

Figure  1  immediately  leads  to  a  conjecture.  Since  the  relative  sizes  of  the 
solution  regions  on  the  boolean  sphere  represent  how  difficult  it  is  to  learn  them, 
and  since  our  learning  algorithm  will  be  asked  to  learn  one  of  these  functions 
randomly,  the  best  strategy  is  to  try  to  get  regions  of  about  the  same  relative 
size.  Binary  input  vectors  however  lead  to  unsymmetrical  cuts  in  weight  space 
and  the  solution  regions  are  of  very  different  size.  Symmet^ic^d  cuts  can  be 
achieved  by  substituting  the  binary  input  vectors  by  bipolar  ones  and  training 
the  perceptron  under  this  coding.  Table  3  was  calculated  using  a  Monte  C^trlo 
method.  A  normed  weight  vector  was  generated  randomly  and  its  associated 
boolean  function  was  computed.  By  repeating  the  experiment  many  times  it 
was  possible  to  calculate  the  relative  volumes  of  the  solution  regions.  The  table 
shows  that  the  maximum  variation  in  the  relative  sizes  of  the  14  possible  regions 
is  given  by  a  factor  of  1.33  when  bipolar  coding  is  used,  whereas  in  the  binztry 
case  it  is  about  12.5.  This  means  that  with  binary  coding  some  regions  are 
almost  an  order  of  magnitude  smaller  than  others.  And  indeed  it  has  been 
empirically  observed  that  multilayer  neural  networks  are  easier  to  train  using  a 
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Table  1;  Relative  sizes  of  the  regions  on  the  boolean  sphere 


AJtJ 


Figure  2:  Solution  regions  on  the  surface  of  the  boolean  sphere 


bipolar  representation  than  a  binary  one.  The  rationale  for  this  fact  is  given  by 
the  size  of  the  regions  in  the  boolean  sphere.  It  is  also  possible  to  show  that 
bipolar  coding  is  optimal  under  this  criterion. 

We  can  go  one  step  further  and  ask  what  is  the  shape  of  the  error  function 
for  perceptrons  in  weight  space.  Given  a  boolean  function  /  euad  a  weight  vector 
Wi,W2,  ■ Wn+i  the  error  function  is  the  number  of  misclassiiied  input  vectors. 
We  can  visualize  this  function  for  the  case  of  a  perceptron  with  two  inputs 
by  projecting  the  solution  regions  on  the  three  dimensional  booleam  sphere  in 
weight  space  on  a  plane  using  an  adequate  transformation.  Figure  2  shows  a 
stylized  representation  of  the  result  of  using  a  stereographic  projection.  The 
great  circles  on  the  boolean  sphere  are  projected  to  ellipses.  Since  we  are  only 
interested  in  the  number  of  regions  euid  the  topological  relation  between  them, 
we  can  think  of  the  projection  of  the  great  circles  of  the  booleetn  sphere  as  circles 
in  the  plane.  The  result  is  Figure  2,  which  also  shows  which  boolerm  functions 


m-361 


Figure  3:  The  error  function  on  the  surface  of  the  boolean  sphere 


of  two  variables  are  computed  in  each  region. 

If  we  are  looking  for  a  weight  vector  to  compute  the  zero  function,  the  shape 
of  the  error  function  is  the  one  shown  in  Figure  3.  It  can  be  seen  that  there  is 
a  single  region  in  which  the  error  function  reaches  its  maximum,  and  a  single 
region  in  which  it  reaches  its  minimum.  Since  there  are  no  local  minima,  any 
greedy  procedure,  like  classical  perceptron  learning,  converges  to  a  solution  in 
weight  space  whenever  this  exists. 

It  should  be  clear  from  the  figure  that  these  conclusions  do  not  depend  on 
the  specific  function  selected  for  the  training  process.  This  visualization  of  the 
error  function  is  superior  to  the  one  used  in  [3]. 

In  the  case  of  multilayer  networks  we  can  also  visualize  the  form  of  the  error 
function,  like  it  is  shown  in  Figure  4.  In  this  case  a  network  of  two  hidden  units 
has  been  used.  The  boolean  functions  of  two  arguments  have  been  numerated 
from  /o  to  /i5.  The  table  shows  which  function  is  computed  by  the  first  hidden 
unit  (rows)  and  by  the  second  (columns).  It  is  shown  at  each  place  in  the  table 
thus  defined  which  function  is  computed  by  the  output  unit.  One  can  think  of 
this  diagram  as  of  the  surface  of  the  boolean  sphere  associated  with  the  nine¬ 
dimensional  weight  space  of  the  three  unit  network  used  to  compute  XOR.  All 
solutions  for  the  XOR  task  are  shown  in  the  diagram,  and  one  can  see  the  high 
symmetry  of  the  solution  regions.  They  are  uniformly  distributed  in  weight 
space,  although  the  volume  of  each  region  differs. 

3  Conclusions 

The  relevance  of  the  boolean  sphere  relies  not  on  its  practical  application  to 
perceptron  or  neural  networks  learning,  but  in  its  visualization  power  to  under¬ 
stand  how  learning  algorithms  work  and  why  they  sometimes  fail.  The  boolean 
sphere  allows  us  to  measure  the  difficulty  of  a  learning  task  by  relating  it  to  the 
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Figure  4:  Regions  in  the  boolean  sphere  for  the  XOR  problem 


volume  of  its  solution  region  in  the  boolean  sphere.  With  an  estimate  for  this 
volume  it  is  possible  to  give  an  upper  bound  for  stochastic  learning  of  boolean 
functions.  It  is  easy  to  compute,  for  example,  that  a  solution  for  the  XOR 
problem  can  be  found  with  probability  0.001  using  a  network  with  two  hidden 
neurons  and  trying  around  1800  different  weight  combinations.  This  would  be 
uninteresting,  were  it  not  for  the  fact  that  some  authors  normally  report  still 
larger  iterative  cycles  using  fancy  algorithms.  The  boolean  sphere  is  thus  a 
conceptual  instrument  which  allows  us  to  understand  neural  networks  better. 
As  a  matter  of  fact,  it  has  been  used  implicitly  in  the  Karnaugh  maps  of  logical 
functions.  Simplification  of  logic  expressions  amounts  in  this  formalism  to  the 
elimination  of  superfluous  hyperplane  cuts. 
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Conventional  methods  of  supervised  learning  ate  inevitably  faced  with  the  problem  of 
local  minima;  evidence  is  presented  that  conjugate  gradient  and  quasi-Newton  tech¬ 
niques  are  particularly  susceptible  to  being  trapped  in  sub-optimal  solutions.  A  new 
classical  technique  is  presented  which  by  the  use  of  a  homotopy  on  the  range  of  the 
target  outputs  allows  supervised  learning  methods  to  find  a  global  minimum  of  the 
error  function  in  almost  every  case. 


Introduction 

The  problems  to  which  neural  computing  techniques  are  most  frequently  applied  involve  the 
supervised  learning  of  an  input-output  mapping,  defined  implicitly  by  a  set  of  P  input  patterns 
together  with  their  desired  outputs.  Such  tasks  can  be  formulated  .  error-minimisation  prob¬ 
lems,  where  the  error  function  is  usually  given  by  p 

E  =  iZEp  =  -=jTZl«ll,p-2i.p)"  (1) 

‘  p=I  ‘  p=I  i=l 

where  di,p  and  Zi,p  are  the  desired  and  actual  values  of  the  ith  output  unit  for  pattern  p,  for  a  net¬ 
work  wim  N  output  units.  E  is  a  function  of  all  the  parameters  (weights  and  thresholds)  of  the 
network;  this  parameter  list  can  be  written  as  a  multidimension^  vector  w.  The  problem  is  to 
change  w  so  as  to  avoid  those  solutions  of  the  minimisation  condition  dE/9w  =  0  which  do  not 
correspond  to  the  lowest  value  of  E,  the  local  minima  of  the  error-weigHT  surface.  The  most 
commonly  used  supervised  training  technique,  error  backpropagadon  (BP)  (equivalent  to  gra¬ 
dient  descent  with  a  fixed  step  length)  is  well  known  to  have  difficulties  with  local  minima, 
especially  for  non  linearly  separable  problems  [1].  What  is  less  well  known  is  that  the  neural 
implementations  of  more  efficient  classical  minimisation  algorithms,  such  as  conjugate  gra¬ 
dients  (CG)  or  the  quasi-Newton  method  (QN),  are  even  more  likely  to  be  trapped  in  sub- 
optimal  solutions.  Table  1  shows  the  percentage  success  in  reaching  a  global  minimum  for  1(X) 
(2-2-1)  networks  learning  to  solve  the  XOR  problem. 
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XOR  is  a  useful  benchmark  because  it  is  a  non  linearly  separable  problem  with  known  local 
minima  [2],  but  one  which  can  be  solved  by  a  small  netwoik  with  only  9  adaptive  weights. 
Linear  ou^uts  in  the  second  layer  (as  oppos^  to  sigmoidal  squashing  for  both  computational 
layers)  improve  the  percentage  success,  but  there  is  a  clear  trend  toward  worsened  performance 
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for  the  more  sophisticated  algorithms.  Simple  on-line  BP  (without  momentum)  performs  best; 
this  may  be  due  to  the  method’s  stochastic  features,  as  discussed  in  [3]. 

Trapping  in  local  minima  can  also  be  observed  for  continuous  function  learning  problems. 
Mchiemey  et  ad  [4]  have  discovered  (by  exhaustive  search  of  the  error-weight  surface)  local 
minima  in  a  (1-2-1)  network  (with  a  linear  output  node)  learning  the  sine  function.  This  problem 
was  also  investigated,  using  the  same  training  set  as  in  [4],  and  the  results  are  summarised  in 
Table  2. 
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These  results  do  not  show  as  high  a  probability  of  trapping  in  local  minima  as  in  the  XOR 
example,  but  there  is  still  a  significant  correlation  between  the  probabilit  failure  and  the 
convergence  speed  of  the  method;  the  quasi-Newton  method,  with  a  13^  re  rate,  would 
probably  not  be  a  good  choice  unless  multiple  restarts  were  acceptable. 

It  is  commonly  believed  -  though  we  do  not  know  of  any  ’no-go  theorem’  to  this  effect  -  that  the 
only  techniques  guaranteed  to  converge  to  a  global  minimum  with  a  probability  approaching  1 
are  stochastic  in  character,  with  methods  based  on  simulated  annealing  [S]  and,  currently, 
genetic  algorithms  [6]  being  among  the  most  popular.  However  these  techniques  can  be  very 
slow  and  must  be  applied  carefully  in  order  to  ensure  a  good  solution.  Is  there  a  way  to  retain 
the  fast  convergence  of  techniques  like  conjugate  gradients  and  the  quasi-Newton  method  whilst 
improving  the  robustness  of  these  algorithms  in  tl^  face  of  local  minima?  We  will  present  here 
a  new  and  purely  classical  method  which  is  guaranteed  to  succeed  in  avoiding  local  minima  in 
almost  all  cases. 


Expanded  range  approximation  (ERA) 

The  basic  idea  underpinning  this  new  algorithm  is  that  of  a  homotopy  on  the  range  of  the  target 
values  dp  (for  simplicity  we  consider  just  one  ouq)ut  no<^).  This  range  is  modified  by  compress¬ 
ing  these  values  down  to  their  mean  value  <d>  =  ~  £  dp  and  then  progressively  expanding 

these  compressed  targets  back  toward  their  original  v^es  (hence  the  epithet  ’expanded  range 
approximation’,  or  we  have  coined  for  this  approach).  We  define  a  modified  training  set 

S(X)  =  {^,dp(K)]  =  {Xp.<d>  +  Mdp-<d>)} 
where  the  dp(X)  are  the  new,  compressed,  targets.  The  problem  defined  by  S(0)  is  easy  for  the 
network  to  solve  (the  corresponding  error-weight  surface  can  be  shown  to  have  only  a  global 
minimum);  S(l)  is  the  original  problem  with  trmning  set  {^,  dp}.  The  homotopy  parameter  X 
interpolates  between  these  extremes.  A  X-parametrised  error  ronction  can  be  defined  during 
training  on  each  of  the  sets  S(X)  b^  ^ 


E(^)  =  iZ  [<d>  +  Mdp-<d>)  -  Zp(K)f 


(2) 


p=i 


where  the  Zp(X)  are  the  actual  network  outputs  during  this  procedure.  Setting  X  =  1  gives 
E(l)  s  E,  the  error  function  (1)  in  the  case  of  a  single  output  node.  The  ERA  method  involves 
first  solving  the  problem  S(Xi)  for  small  Xt,  then  the  problem  S(X2)  with  X2  >  Xj,  and  so  on  up 
to  the  original  problem  S(l).  We  have  usudly  chosen  to  increase  X  by  uniform  steps  of  1,;  an 
’N-step  ERA’  method  refers  to  the  progressive  solution  of  the  N  problems  S(Xg  =  nil) 
n  =  1..N=1/ti  (’1-step  ERA’  (ti=1)  is  the  conventional  single  step  training  technique). 
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As  a  first  example,  the  ERA  method  was  applied  to  the  same  100  XOR  networks  (with  sig¬ 
moidal  output  in  the  final  layer)  as  in  Table  1,  using  the  CG  algorithm.  With  10-step  ERA 
(T\=^.l),  the  success  rate  improves  dramatically  f^rom  51%  when  ii=l  to  94%.  Figure  1  shows  a 
training  curve  for  a  particular  set  of  XOR  weights  which  led  to  a  failure  when  ti=1  (curve  (a)), 
but  succeeded  with  10-step  ERA  (curve  (b)).  The  error  function  plotted  for  the  10-step  ERA 
case  is  the  square  root  of  E(l),  the  error  with  respect  to  the  original,  uncompressed  targets.  The 
errors  E(X)  always  decrease  during  ERA  steps,  but  the  overall  E  can  show  local  increases,  as  is 
evident  from  Figure  1. 


Figure  1  Figure  2 

If  the  step  size  q  is  decreased,  the  percentage  success  improves  still  further;  100-step  ERA 
(q=0.01)  is  100%  successful  in  solving  the  XOR  problem.  As  a  second  example,  2-step  ERA 
(q=0.5)  was  applied  to  the  sine  problem  of  Table  2,  using  the  QN  method.  In  this  case  -  a  con¬ 
tinuous  as  opposed  to  binary  problem,  a  linear  as  opposed  to  sigmoidal  output  in  the  final  layer, 
a  different  training  method  -  there  was  also  a  very  significant  improvement,  from  87%  success 
when  q=l  to  100%  for  2-step  ERA, 

It  might  seem  that  the  ERA  technique  could  become  computationally  expensive  if  very  many 
small  steps  were  required.  In  fact  we  suspect  that  it  will  be  possible  to  make  a  short  cut  in  many 
cases.  We  have  so  far  observed  that  an  ERA  simulation  which  fails  to  find  a  global  minimum  of 
E(Xi)  will  not  subsequently  succeed  as  the  homotopy  parameter  1.  Conversely,  however, 
in  our  experience  a  simulation  which  succeeds  at  the  first  step  never  subsequently  fails.  This 
suggests  that  it  is  most  important  to  get  the  first  step  right,  and  that  subsequent  range  expansion 
does  not  need  to  be  done  so  carefully.  Figure  2  shows  the  percentage  of  successes  at  the  first 
step  (successful  minimisations  of  E(X,i  =  q)  as  a  function  of  the  size  of  q  for  the  same  100  XOR 
networks  used  in  the  single  step  (conventional)  and  10-step  tests.  The  dependence  appears  to  be 
roughly  linear,  with,  in  this  case,  100%  success  at  q=0.01. 

All  the  initial  simulations  suggested  a  special  role  for  q,  the  size  of  the  first  step.  In  order  to  try 
to  get  some  further  insight  into  the  process,  we  looked  at  the  trajectories  in  ouq)ut  space  fol¬ 
lowed  for  the  XOR  problem  by  the  ^(Xj  =  q),  the  first-step  responses  to  the  four  patterns  p  = 
00,  01,  10,  11.  Since  the  initial  weights  are  randomly  chosen  (from  the  interval  [-1,1])  the  tra- 
j^tories  in  these  experiments  begin  at  some  arbitrary  point  inside  the  hypercube  [0, 1]  .  The  tar¬ 
get  for  q=l  is  the  point  (0,1, 1,0);  the  targets  for  q  <  1  lie  on  a  line  joining  this  point  to 
(‘/i,Vi,‘/2,y2).  By  taking  pairs  of  these  responses  we  were  able  to  plot  trajectories  in  the  six  2- 
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dimensional  (Zp,(ii),  Zpj(Ti))  subspaces  during  CG  training.  Figures  3a-d  illustrate  trajectories  in 
(zoo(^X  zio(il))  space  for  t|=1.0  (Figure  3a),  ti=0.3  (3b),  ti=0.2  (3c),  t]=0.1  (3d).  A  coordinate 
transformation 

(X,  y)  =  -(Zoo  -  ‘/2(l-Tl),  zio  -  ‘A(1-ti)) 

I) 

is  used  in  plotting  the  diagrams  so  that  the  scales  are  identical,  and  in  each  case  the  target  is  the 
top  left  hand  comer.  The  midpoint  on  the  y-axis  represents  a  local  minimum. 


Figure  3c  Figure  3d 


Figure  3a  shows  a  conventionally  trained  (ti=1)  network  which  fails  to  solve  the  XOR  problem, 
becoming  trapped  in  the  local  minimum  corresponding  to  a  final  response  to  the  four  patterns  of 
z  =  (0,  ‘A,  ‘A,  0).  Figure  3b  also  shows  a  failure,  for  t|=0.3,  but  notice  that  the  trajectory  appears 
ib  almost  escape  from  the  local  minimum.  Figure  3c  shows  a  success  at  il=0.2,  but  the  trajectory 
still  spends  a  lot  of  time  in  the  vicinity  of  the  local  minimum  before  escaping.  Finally,  Figure 
3d,  with  Ti=0.1,  shows  a  trajectory  which  entirely  avoids  the  vicinity  of  the  local  minimum. 
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heading  more  or  less  directly  for  the  global  minimum  of  E(Xi  =  t]).  There  appears  to  be  a  pro¬ 
gressive  change  of  behaviour  as  t]  is  decreased;  this  progression  is  most  mariced  for  small  values 
ofTi. 

In  order  to  further  investigate  this  progressive  change  in  the  first-step  behaviour  we  looked  at 
the  values  of  the  9  weights  which  were  developed  by  typical  examples  of  the  (2-2-1)  XOR  net¬ 
works  after  minimisation  of  the  first-step  error  function  E(ii).  In  Figures  4a,b  the  weights  plot¬ 
ted  for  the  ’test  number’  -1  are  the  original,  randomly  chosen  weights  €  [-1,1],  those  for  test 
number  0  represent  the  solution  reached  when  t)=0  (when  the  error- weight  surface  has  only  a 
global  minimum),  those  for  test  numbers  >  0  the  weights  w(r|)  developed  during  the  minimisa¬ 
tion  of  £(11  >  0)  for  progressively  larger  values  of  T|  (using  a  logarithmic  scale).  In  Figure  4a, 
tests  1,  2,  3,  4  represent  t^-values  0.()01,  0.01,  0.1,  1.0;  in  Figure  4b,  tests  1,  2,  3  represent  ti- 
values  0.00001, 0.0001, 0.001. 


Figure  4a 


Figure  4b 


Figure  4a  shows  almost  linear  relationships  between  the  values  of  the  9  weights  and  log(T|),  but 
with  a  very  striking  change  in  behaviour  (possibly  representing  a  phase  change  of  the  learning 
system)  at  some  critical  vdue  Ticnt  between  0.1  and  1.0.  A  somewhat  closer  investigation  shows 
that  in  this  case  0.1  <ilcrit  <0.2.  We  note  that  this  system  succeeds  in  finding  the  global 
minimum  of  E(0.1),  but  Mis  into  a  local  minimum  of  E(0.2);  the  change  in  behaviour  in  w(t|) 
evident  in  Figure  4a  is  clearly  related  to  the  switch  from  success  to  failure  as  T|  increases  i^ch 
is  illustrated  (for  another  XOR  network  with  0.2  <  iioit  <  0.3)  in  Figures  3a-d.  One  criticism 
that  might  be  levelled  at  the  results  shown  in  Figure  4a  is  that  the  weights  for  small  i)  look  very 
similar  to  the  starting  weights  -  has  the  network  been  trained  sufficiently  in  these  cases  for  any 
meaningful  conclusions  to  be  drawn?  By  looking  more  closely  at  the  behaviour  for  very  smaU 
T],  this  criticism  can  be  seen  to  be  unfounded.  Figure  4b,  which  uses  a  different  set  of  starting 
weights,  shows  that  in  general  there  is  a  large  difference  between  the  starting  weights  and  the 
0=0  solution,  but  thereafter  a  much  smoother  progression  with  increasing  O¬ 
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Discussion 


This  paper  has  presented  results  which  violate  the  widespread  belief  that  the  only  way  to  avoid 
local  minima  in  supervised  learning  problems  with  complex  error-weight  surfaces  is  to  use  com¬ 
putationally  expensive  stochastic  procedures  like  simulated  annealing  or  genetic  algorithms.  If 
the  results  presented  here  can  be  shown  to  be  securely  founded,  and  the  ERA  method  shown  to 
have  wide  applicability,  there  could  be  a  significant  changes  in  the  way  that  supervised  learning 
ta-tlcs  are  approached.  We  believe  that  it  is  possible  to  construct  a  rigorous  mathematical  proof 
that  the  ERA  method  will  woik  in  all  but  pathological  (and  rare)  cases.  The  details  of  this  proof 
are  too  lengthy  to  be  presented  here,  but  the  general  principles  can  be  outlined.  Initially  we 
look  at  the  first  ERA  step,  for  which  the  homotopy  parameter  0  <  X  1.  For  such  small  X,  the 
error  E(X)  of  (2)  can  be  expanded  as 

E(X)  =  E(0)  +  XEi  -KXX^) 

IP  ,  IP 

E(0)  =  -^  S  [<d>  -  Zp(0)]^’  El  =  -  X  (dp  -  <d>  -  z'p(0))zp(0) 

p=i  P=i 


where  the  derivative  z'p(0)  depends  on  the  particular  learning  law,  but  is  assumed  bounded.  It  is 
possible  to  show  that  E(0)  has  only  a  global  minimum.  Then  we  can  investigate  the  shape  of  the 
surface  E(X)  by  looking  at  the  effect  of  the  additional  term  XEi  for  small  X.  Outside  some  small 
neighbourhood  No  of  the  global  minimum  Po  of  E(0)  the  effect  of  this  additional  term  can  be 
made  arbitrarily  small  by  choosing  X  small  enough.  In  particular  it  can  be  shown  that  no  local 
minima  can  exist  outside  No,  as 


■^(E(O)-pXEi) 


0 


xl^l  <  1^1  ■ 

I  _  I  1  _  I 

Within  No  the  global  minimum  of  ^0)  will  in  general  be  shifted;  there  can  be  degraerate  cases 
where  a  number  of  global  minima  cotdd  arise,  but  this  set  would  be  exp^ted  to  be  of  measure 
zero.  Continued  expansion  of  the  homotopy  parameter  X  1  is  handled  in  a  similar  way  to  the 
first  step  above.  There  may  be  obstructions  to  performing  the  homotopy  up  to  Xsl,  but  we 
expect  that  such  cases  will  rare.  Further  investigations  are  in  progress,  and  will  be  reported  in 
the  literature. 


References 


[1]  M  Gori  and  A  Tesi,  "On  the  problem  of  local  minima  in  backpropagation”,  IEEE  Trans,  on 
Pattern  Analysis  and  Machine  Intelligence,  14, 76-86  (1992). 

[2]  E  K  Blum,  "Approximation  of  Boolean  functions  by  sigmoidal  networks:  Part  I:  XOR  and 
other  two-variable  functions".  Neural  Computation,  1, 532-540  (1989). 

[3]  C  Darken  and  J  M  Moody,  "Towards  faster  stochastic  gradient  search",  in:  Advances  in 
Neural  Information  Systems  4,  Morgan  Kaufmann,  San  Mateo,  CA,  1(K)9-1016  (1991). 

[4]  J  M  Mclnerney,  K  G  Haines,  S  Biafore  and  R  Hecht-Nielsen,  "Error  surfaces  of  multi¬ 
layer  networks  can  have  local  minima",  UCSD  Tech.  Rep.  CS89-157,  October  1989. 

[5]  L  Ingber,  "Very  fast  simulated  re-annealing",  MaM.  Comput.  Modelling,  12,  967-973 
(1989). 

[6]  J  Holland,  Adaptation  in  Natural  and  Artificial  Systems,  University  of  Michigan  Press, 
Ann  Arbor,  MI  (1975). 


GLOBALLY  OPTIMAL  NEURAL  LEARNING 
J.  Barhen’*^,  A.  Fijany*,  and  N.  Toomarian’ 

California  Institute  of  Technology 
Pasaidena,  CA  91109 

Abstract:  One  of  the  fundamental  limitations  of  current  artificial  neural  network  learning 
paradigms  is  the  susceptibility  to  local  minima  during  training.  Building  upon  the  recently 
discovered  TRUST  global  optimization  methodology,  computationally  efficient  algorithms  are 
presented,  that  enable  overcoming  local  minima,  both  for  backpropagation  schema  in  feed  for¬ 
ward  multilayer  architectures,  and  for  adjoint-operator  learning  in  recurrent  networks.  Exten¬ 
sions  to  TRUSX  ere  introduced,  that  now  formally  guarantee  convergence  to  a  global  minimum 
in  the  multidimensional  case.  Results  for  a  standard  benchmark  are  included,  to  illustrate  the 
theoretical  developments. 


Introduction 

Considerable  efforts  have  recently  been  devoted  to  the  develojimeiit  of  efBcient  com¬ 
putational  methodologies  for  learning.  Even  though  the  primary  emphasis  has  largely 
been  on  error-backpropagation  algorithms  for  “feedforward”  [1]  architectiures,  the  more 
complex  “recurrent”  networks  [2-5]  are  currently  receiving  renewed  attention.  In  partic¬ 
ular,  the  introduction  of  adjoint-operator  formsdisms  for  very  fast  procesing  of  static  [6] 
and  time-dependent  [7]  phenomena  has  opened  new  avenues  in  terms  of  computational 
efficiency  and  relevance  to  real-time  applications.  The  development  of  such  learning  algo¬ 
rithms  is  generally  based  upon  the  minimization  of  an  energy-hke  “neviromorphic”  function 
or  functional.  The  main  emphasis  of  most  researdi  to  date  has  been  on  how  to  best  obtain 
the  gradients  [8]  of  this  fimction  or  functional  with  respect  to  the  various  parameters  of 
the  neural  arcuitecture. 

The  si^c^tibility  to  local  minima  during  training  remains,  however,  one  of  the  fun¬ 
damental  limitations  of  current  artificial  neural  network  learning  paradigms.  Heretofore, 
local  minimization  techmques  have  provided  the  mmn  operationail  tools  for  implementing 
the  corresponding  algorithms,  with  the  notable  exception  of  stochastic  paradigms  such  as 
the  “Boltzmann  Machine”  [9]  or  “diffusion”  processes  [10].  Tliis  paper  presents  a  new 
approadb  to  learning,  in  which  the  gradient  descent  mechanism  is  replaced  by  a  method- 
olopr  based  upon  a  recently  developed  global  optimization  scheme,  acronymed  TRUST 
[lljr  The  deterministic  TRUST  algorithm  formulates  global  optimization  in  terms  of  the 
flow  of  a  sp^ial  nonlinear  dynamical  system.  TRUST  can  be  applied  both  to  bachpropa- 
gation  learning  and  to  recurrent  networks.  Orinnally  [11],  we  could  only  guarantee  con¬ 
vergence  to  a  global  minimiun  for  1-dimensionaI  problems,  eventhough  in  practice,  for  all 
n-dimensional  applications,  a  global  minimiun  was  always  reached.  Here,  we  introduce  a 
simple  extension  to  TRUST,  which  now  formally  gucurantees  convergence  to  global  minima 
in  the  multidimensional  case.  We  demonstrate,  on  a  couple  of  standard  benchmarks,  that 
our  approach  indeed  overcomes  encountered  local  minima,  and  thereby  provides  a  globally 
optimal  solution  to  the  learning  problems  under  cxinsideration. 


(')  Jet  Propulsion  Laboratory,  4800  Oak  Grove  Dr.,  Tel.  (818)354-9218,  Fax  (818)393- 
5013,  (^)  Applied  Physics  Department 
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Global  Optimization 


Let  /(z)  be  a  twice  continuously  differentiable  scalar  function  of  the  variable  x  defined 
over  an  interval  of  interest.  Consider  the  nonlinear  tran.sformation 


E(x,x*)  =  In 


1 

1  4-  e-l/(*)+*l 


Ip(x  -  H[Hx)] 


(1) 


where  H  denotes  the  Heaviside  function,  x*  is  a  fixed  value  of  x,  the  selection  of  which 
will  be  discussed  below,  and  /(x)  represents  an  offset  of  /(x)  by  the  amount  /(x*).  The 
positive  parameters  e  and  p  will  be  specified  in  the  sequel.  To  highlight  the  two  basic 
concepts  which  underlie  the  TRUST  formalism,  we  rewrite  Eq.  (1)  as 

E(x,x*)  =  E„>(x,x*) +  Ep(x,x*)  (2) 

The  sub-energy  timneling  function,  Eaut,  is  defined  as  the  first  tenn  in  the  right-hand-side 
of  Eq.  (1).  It  has  the  same  relative  ordering  of  local  and  global  minima  as  /(x)  since 


OEauh 

dx 


=  0 


I- 


furthermore,  Egub  is  monotomcally  increasing  in  terms  of  both  /(x)  and  /(x).  We  want 
Eaub  to  have  the  following  effects:  first,  it  should  suppres.s  f(x)  to  zero  for  /(x)  >  /(x*); 

second,  it  should  leave  f(x)  nearly  unmodified  for  /(x)  <  /(x*).  These  requirements 
determine  poissible  values  for  the  parameter  £.  Figure  1  illustrates  the  action  of  Eaub  on 
the  function  /(x)  =  [sin2x  —  x  —  1]^  with  x*  ~  —6.80678  and  £  =  2.  We  see  that  Eaub 
preserves  all  properties  relevant  for  optimization.  The  term  Ep  induces  a  terminal  repeller 
[12]  effect. 


The  basic  idea  driving  our  approacli  to  globally  optimal  learning  is  to  construct  a 
dynamical  .system  which  .switches  autonomou.sly  between  two  phases:  (1)  a  timneling  phase, 
where  the  system  flows  under  the  action  of  a  terminal  repeller  over  a  sub-energy  surface 
flattened  to  near  zero  level  for  all  values  of  /(x)  almve  a  tlureshold  /(x*);  (2)  a  gradient 
descent  phase,  during  which  the  repeller  effect  is  switched  off,  and  which  starts  whenever 
a  lower  energy  valley  has  been  reached,  to  obtain  a  new  threshold  /(x*). 

For  actual  neural  learning,  /  is  an  error  function  or  fimctional  of  a  parameter  vector 
X  (  including,  e.g.,  the  synaptic  weights,  the  potential  decay  constants  and  the  sigmoidal 
gains).  For  the  sake  of  simplicity,  the  following  discussion  is  limited  to  one  dimension.  In 
this  vein,  uixm  application  of  gradient  descent  to  the  E  fimctional,  we  obtmn 


X 


df{x)  1 

dx  1  +  e/(®)+« 


+  p(x  -  x-)'/^Hlf(x)] 


(3) 


We  assume  that  the  optimization  is  to  take  place  in  a  specified  domahi,  D,  which,  in  the 
multidimensional  ca.se,  will  be  taken  as  a  hypeqiarallelepiped.  In  the  one-dimensionaJ  case, 
D  is  simply  the  interval  [xi,X{/].  To  initiate  the  oj>timization  process,  x*  is  cho.seu  to  be 
the  lower  hmit  of  D,  i.e.,  x*  =  xi.  This  point  need  not  be  a  local  minimum.  A  repeller 
is  placed  at  x*,  and  the  dynamical  system  f3)  is  given  the  initial  condition  x*  -|-  ^,  where 
^  is  a  very  small  positive  perturbation,  so  that  the  system  flows  in  the  positive  direction, 
i.e.,  towau-d  the  upper  limit  of  D. 


The  selection  of  x*  defines  a  threshold  caJled  the  zero  sub-energy  limit,  /(x*),  above 
which  is  nearly  zero  in  value  and  approximately  flat.  If  /(x*  +  0  <  /(**)» 

algorithm  immediately  enters  a  gradient  descent  phase,  in  whicli  the  state  of  the  dynamical 
system  flows  down  the  gradient  and  reaches  its  first  local  minimiun  at  x  =  At  this 

point,  we  set  X*  =  x*^*)  in  (3),  and  jierturb  i  to  x*  -J-  Since  x'^*)  is  a  local  minimtuu, 
f{x)'>f{x*)  in  a  neighborhood  of  x*.  Consequently,  the  repelling  term  is  active.  The 
repeller  located  at  x*  drives  the  solution  across  the  flattened  sub-energy  surface,  which  in 
effect  pushes  the  system  uphill  on  the  surface  of  the  associated  objective  function.  The 
dynamical  system  remains  in  the  repelling  phase  until  it  reaches  a  lower  valley,  where 

/(x)  <  0.  This  phase  tunnels  through  all  of  the  state  space  region  with  functional  vnlu<-s 

that  he  above  the  last  foimd  local  minimiun,  /(x*^*^).  A  lower  valley  of  f?„fc(x,x*  ’s  a 
lower  value  of  /(x)  as  well.  As  the  dynamical  system  enters  the  next  valley,  the  algorithm 
automatically  switches  to  the  second  phase,  where  the  terminal  re]>eller  term  is  zero  and 
gradient  descent  takes  over,  leacUng  to  further  minimization  of  /(x).  The  system  will 

equiUbrate  at  the  next  lower  local  minimum,  We  set  x*  =  and  repeat  the 

process.  If  /(x*  +  0  —  /(^*)  when  the  optimization  procedure  is  initiated,  the  dynamical 
system  will  initially  enter  the  tumiehng  phase.  The  tuimeling  will  proceed  to  a  lower  valley, 
at  which  point  the  algorithm  enters  a  gradient  descent  phase  and  follows  the  Ijehavior 
discussed  above. 

The  two-phase  descent-and-tumieling  process  continues  until  a  suitable  stopping  cri¬ 
terion  is  satisfied.  Here,  as  soon  as  the  lowest  local  minimum,  x^,.,  in  D  has  1)een  reached, 
the  optimization  cycle  is  rei>eated  by  placing  a  repeller  at  x^.  and  perturbing  the  system 
to  initiate  the  next  timneling  phase.  In  this  case,  the  sub-energy  transformation  flattens 
/(x)  in  the  entire  domain  of  interest,  since  /(xj.)  is  the  lowest  objective  function  value  in 
D.  The  perturbed  dynamical  system,  which  is  now  in  the  repeller  timneling  phase,  will 
eventually  flow  beyond  the  upper  boimdary  of  D.  Thus,  when  the  state  flows  out  of  the 
domain  iMundary,  i.e.,  when  x  >  xy,  the  last  locad  minimum  found  is  the  global  minimum 
of  the  fimction  in  D. 

Multidimensional  Global  Optimization 

The  one-dimensional  aJgorithin  can  easily  l)e  extended  to  handle  multidimensional 
problems.  Let  /(x)  be  a  fimction  of  the  M—  parameter  vector  x,  and  define  the  nonlinear 
transformation 


1  n  ^ 

The  multidimensional  sub-energy  term  is  analogous  to  the  one- dimensional  sub-energy 
functional  The  portions  of  the  objective  function  surface  which  lie  above  the  zero 

sub-energy  limit,  /(x*),  are  flattened  by  the  use  of  the  J5,u6(x,x*)  transformation. 

Upon  application  of  gradient  descent  to  f;(x,x*)  in  Eq.  (4),  we  obtain  the  vector 
differential  equation 


X  ^  — 


d/(x) 


-I-  p(x^  -  x*yf^  H[f{x.)] 


(5) 


The  dynamical  system  (5)  has  a  highly  parallel  structure  and  its  initial  conditions,  opera¬ 
tion,  and  stopping  criterion  are  highly  analogous  to  the  one  dimen^onal  case. 
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In  the  multidimensional  case,  x*  is  initially  chosen  to  be  one  corner  of  the  hyperpar- 
allelpiped  P,  usually  =  xn  V*.  A  re|ieller  is  placed  at  x*.  It  should  be  noted  that 
the  r^)elling  terms  in  tne  miiltidimensional  case  can  be  interpreted  as  “line”  or  “hyper- 

plane”  repellers,  and  are  active  whenever  /(x)  >  0.  It  should  also  be  noted  that  in  the 
multidimensional  ttmneling  phase  the  sub-energy  gradient  is  not  identically  zero.  While 
the  repeller  terms  dominate,  the  amotmt  of  information  in  the  sub-energy  gradient  helps 
the  state  to  flow  in  a  direction  that  ultimately  leads  to  the  next  lower  valley. 

In  a  departure  from  this  paradigm,  descril>ed  in  detail  in  [11],  and  in  order  to  for¬ 
mally  guarantee  the  convergence  to  a  ^obal  minimum  in  the  multidimenaonal  case,  we 
consider  here  an  approach  whereby  the  M-dimensionad  learning  error  function  (typically 
M  >  N^,N  being  the  nvimber  of  neiurons  in  the  network)  is  represented  in  terms  of  a 
single  variable.  To  achieve  such  a  representation,  Kolmogorov  embeddings  [13]  could  be 
envisioned,  but  they  result  in  a  computationally  complex  nonsmooth  transform.  On  the 
other  hand,  multidimensional  embeddings  based  on  the  dense  covering  of  the  real  plane  by 
an  Archimedes’  spiral  are  well  known  [14].  U.sing  the  latter  tran.sfomiation,  we  can  write; 

/[wi,---  w^,-*-  WAf]  = /[ai(w),---a^(w),-  -aAf(w)]  (6) 


Each  synaptic  weight  w^  has  1)een  expresised  in  terms  of  a  single  parameter  w,  using  the 
Archimedes  spiral  representation.  Specifically,  for  a  2-dimensional  ca.se,  we  can  write 

wi=ota>casa>  W2=0'u;sina>  (7) 

where  a  is  a  constant  controlling  the  precision  of  the  computation.  TRUST,  the  fastest 
known  [11]  1-D  global  optimization  algorithm  can  now  readily  be  applied  to  the  fimction 
f(w). 

Learning  Examples 

To  illustrate  the  power  of  this  novel  approach,  we  sek  the  global  minima  of  the  fol¬ 
lowing  two  functions: 


/(  W,,  W2)  =  is  i  cos[(i  -f  1)  wj  +  t]  HS  ;cos[(j  +  1)  W2  -|-i]| 


(8) 


and 

/(wi,  W2)  =  [4-2.1wf  +  (wV3)]wf  -f  wiW2  +  4(w|-l)w|  (9) 

The  first  function  is  displayed  in  Fig.  2,  and  is  shown  to  exhibit  a  very  complex  structure. 
For  weights  in  the  domain  [-10,  -1-10]  x  [-10,  -flO],  it  possesses  760  local  minima.  We 
measure  the  cost  of  the  globed  minimzation  in  terms  of  the  number  of  fimc' '  >n  evaluations. 
Then,  imder  identical  operating  conditions,  we  find  that  the  respective  costs  of  the  best 
stochastic  method  (Aluffi-Pentini  [15]),  Newton-tunneling  (Levy-Montalvo  [16]),  and  in¬ 
terval  algorithm  (Walster  [17])  were  241215, 12160  and  7424,  respectively.  TRUST,  on  the 
other  hand  reached  the  global  minimum  after  only  269  fimction  evaluations.  The  second 
brachmark,  the  well  known  two-dimen.sional  six-hump  camelb2Mdc  function,  pos.sesses  six 
local  minima  and  two  global  minima  in  the  domain  [-3,  -1-3]  x  [-2,  -|-2].  Starting  at  point  (- 
3,  -2)  it  required  10,822  evaluations  u.sing  the  fastest  stochastic  methods,  1496  evaluations 
using  Newton-tiinneling  and  168  evaluations  via  TRUST. 
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In  the  present  article  we  present  two  learning  rules  for  symmetric  diffusion 
networks  (SDN),  a  family  of  networks  related  to  the  Boltzmann  machines  but 
that  work  in  continuous  time  with  continuous  activations.  We  focus  our  analysis 
primarily  on  a  property  of  SDN’s  and  Boltzmann  machines  that  arises  jointly 
from  their  stochastic  character  and  the  pressence  of  symmetric  connections. 

This  is  the  fact  that  they  are  capable  of  using  covariance  information,  rather 
than  explicit  error  information,  to  drive  learning.  In  essence,  symmetric  con¬ 
nectivity  allows  covariances  between  quantities  computed  deep  inside  a  network 
and  quantities  computed  at  the  output  to  provide  information  on  how  to  im¬ 
prove  performance.  In  the  following  sections  we  briefly  describe  SDN’s  followed 
by  the  derivation  of  the  learning  rules  and  simulation  experiments. 

1  THE  DIFFUSION  EQUATION 

SDN’s  can  be  seen  as  a  variant  of  the  continuous  Hopfleld  model  (1984)  but  with 
hidden  units  and  a  stochastic  component.  Let  a  =  [aj,  ...a„]  be  a  real  valued 
activation  row  vector.  Let  W  =  [wf ,  ...,  w^]  be  a  real  valued  symmetric  matrix 
of  connections,  where  each  wf  is  the  column  vector  of  connections  to  the 
unit.  The  evolution  of  the  activations  is  governed  by  the  following  system  of 
stochastic  differential  equations: 

daj(t)  =  Aj  drifti{t)  dt  +  <rdB,-(0  ;  i  6  {1, .  .,  n}  (1) 

where  Aj  is  the  processing  rate  of  the  unit,  a  is  the  diffusion  constant,  which 
controls  the  flow  of  entropy  throughout  the  network,  and  dBi(t)  is  a  Brownian 
motion  differential  (Soon,  1973). 

Equation  1  is  known  as  a  Langevin  description  of  a  Markovian  diffusion  pro¬ 
cess  with  a  diffusion  matrix <rl  and  a  drift  vector drifk(a)  =  [dri/ti(a), ...,  dri/t„(a)]. 
The  drift  vector  is  the  exact  gradient,  times  a  negative  sign,  of  a  Hopfleld  style 
Goodness  function 

G(«)  =  i  a’’  -  ^  i  /  f{x)dx  (2) 

^  5^  9i  Jrft 

where  rest  =  /(O).  We  call  the  integral  the  stress  (s,)  of  the  i** 

unit.  If  we  use  a  logit  function  scaled  in  the  min-max  interval,  the  stress  has 
the  following  form:  1 

Si  =  (a,-  —  min)log(ai  —  min)  +  (max  —  ai)log{max  —  o,)  (3) 
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A  known  result  in  Markovian  diffusion  theory  is  that  processes  defined  by  a 
Langevin  equation  satisfy  the  forvitrd  Fokktr-Planck  diffusion  equation.  In  the 
SDN  case,  the  diffusion  equation  assumes  the  following  form: 

aP(a;^.o;to)  ^  <|ao;  to)  )  +  y  V»P(h;  f K;  to)  (4) 

where  P(a;  f  |ao;  to)  represents  the  probability  density  of  the  network  being 
in  state  a,  at  time  i,  given  that  it  was  in  state  ao  at  time  to  <  t.  The  symbol 
V.  is  the  divergence  operator 


V.(drift(a)P(a;  f  |ao:  /o)  )  =  ^  ~(drt7f,(a)  P(a;  f  |ao;  to)  )  (5) 

and  is  the  divergence  of  the  gradient,  also  known  as  the  Laplace  operator, 

V’P(a;<|ao;to)  =  V.VP(a;tlao:fo)  =  ^  (6) 

It  is  easy  to  show  that  the  Boltsmann  probability  density  function 

/’(•)  =  (7) 

makes  the  left  side  of  equation  4  vanish  and  is  the  unique  limiting  distribution 
in  these  networks. 


2  COVARIANCE  LEARNING  RULES 

To  begin,  we  partition  the  activation  vector,  a  £  {A},  into  an  input  vector, 
X  €  {A’},  a  vector  of  hidden  unit  activations,  h  €  {H},  and  an  output  vector, 
y  €  {y}.  So  that  a  =  [x,h,y].  The  input,  hidden,  and  output  sets  may  be 
different  for  different  patterns.  The  central  problem  is  to  find  the  grrulient  with 
respect  to  the  weight  and  gain  parameters  of  an  error  function  in  the  set  of 
output  units  when  the  set  of  input  units  is  fixed  to  a  particular  vector  x.  Since 
most  of  the  results  are  common  to  both  guns  and  weights,  we  proceed  with  our 
derivations  in  terms  of  a  generic  parameter  9,  which  can  be  a  weight  parameter 
Wij,  or  a  gain  parameter  gi,.  Let  us  define  a  random  variable,  r,  which  we  will 
name  the  goodness  signal 

’xhy  ~ 

From  the  definition  of  goodness  in  equation  2  it  is  easy  to  show  that  for  9  =  Wij 


d9 


(8) 
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’xhy  ~  ®i)xhy  (®) 

where  (oj  Oi).gKy  >>  the  coproduct  of  the  t**  and  j'*  elements  in  the  xhy  vector. 
For  9  — 


We  now  proceed  to  derive  the  learning  rules. 

2.1  Minimizing  Information  Gain 

In  this  case  the  derivations  are  similar  to  the  Boltzmann  machine  learning 
derivations  (Ackley  ei  al.,  1985)  but  replacing  sums  with  integrals.  However, 
since  SDNs  are  defined  in  continuous  activation  space,  we  can  also  derive  rules 
for  the  gain  parameters. 

We  use  as  error  function  a  continuous  version  of  the  total  information  gain 
function  (Ackley  tt  ai,  1985) 

TIGx(9)  =  Px4y)  dy  (10) 

where  Px(y)  represents  the  obtained  equilibrium  probability  of  output  vector 
y,  when  the  input  activations  are  fixed  to  the  vector  x  and  Pxd(y)  represents 
the  desired  probability  density  of  the  output  vector  y  when  the  environment  is 
in  input  state  x. 

Following  steps  similar  to  those  used  in  the  Boltzmann  machine  derivations 
but  replacing  sums  with  integrals  it  can  be  shown  that 


=  -^iUExyir))  -  Ex(t)) 


(11) 


where  E<|()  is  the  expected  value  using  the  desired  probability  distribution  of 
output  vectors;  ExyioiOi)  is  the  expected  value  of  the  product  of  the  acti¬ 
vations  of  the  and  units  when  the  input  units  are  fixed  to  pattern  x, 
the  output  units  are  fixed  to  pattern  y  and  the  hidden  units  are  free  to  evolve 
according  to  equation  4;  ExiatOj)  represents  the  expected  value  of  this  prod¬ 
uct  when  the  input  units  are  fixed  to  pattern  x  but  the  output  and  hidden 
units  are  free,  and  f  is  the  step~aize  for  weight  changes.  As  in  the  Boltzmann 
machine,  when  the  network  runs  with  inputs  fixed  to  x  and  outputs  fixed  to 
y,  it  is  said  to  be  in  a  fixed  (plus)  phase.  When  the  inputs  are  fixed  to  x  but 
the  outputs  units  are  not  fixed,  the  network  is  said  to  be  in  a  free  (minus)  phase. 
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The  learning  rule  for  weights  is  often  called  contrastive  Hebbian  Learning  ( 
Galland  and  Hinton,  1989;  Movellan,  1990) 

Au»y  =  (EdiExyioiaj))  -  Exiaiaj)  )  (12) 

2.2  Minimizing  the  Error  of  Average  Activations 

This  rule  is  the  SDN  equivalent  of  the  generalized  delta  rule  (Werbos,  1974; 
LeCun,  1985;Ruinelhart  et  at.,  1986)  in  back  propagation  networks,  based  on 
minimization  of  the  total  sum  of  squares  (TSS).  This  is  an  appropriate  error 
function  for  deterministic  problems  or  for  deterministic  problems  with  indepen¬ 
dent  noise  contamination,  where  we  just  need  to  learn  average  values  of  the 
output  units.  Since  the  error  of  each  output  unit  is  combined  in  an  additive 
manner,  we  do  not  lose  generality  by  focusing  on  the  case  where  there  is  a  unique 
output  unit 

TSS(9)  =  (Ex(y)  -  d)*  (13) 

where  d  is  a  desired  real  value  and  Ex(y)  the  expected  value  of  the  output  unit 
when  the  input  units  are  fixed  to  x.  Using  the  diain  rule 

^TSS(0)  =  (Ex(y)  -  d)^Ex{y)  (14) 

Itcan  be  seen  that 

^TSS(e)  =  Covx{T-,y6y)  (15) 

where  Sy  =  (Ex(y)  —  d);  Covx(t;  y)  is  the  covariance  between  the  goodness 
signal,  wd  the  activation  of  the  output  unit  y  when  the  inputs  are  fixed 

to  X.  For  more  than  one  output  unit,  the  desired  gradients  can  be  computed 
very  efficiently  using  the  following  relationship 

^TSSiO)  =  J2Covx{T;ySy)  =  Covxir^J^yi,)  (16) 

<w  j. 

The  learning  rul'^  for  gains  and  weights  easily  follow 

2 

Awij  =  -c-j  CovxioiOj ;  2^  y5y )  (17) 

Y 

Covx{»k]Y^  yiy)  (18) 

y 
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2.3  Simulations 


We  tested  tbe  learning  rules  on  a  simple  exclusive-or  problem.  Table  1  shows 
the  median  number  of  epochs  to  criterion  and,  in  parenthesis,  the  percentage 
of  simulatKMia  where  criterion  was  achieved  in  less  than  4000  epodis.  These 
statistics  where  obtained  with  20  runs  per  cell  with  random  starting  weights 
chosen  from  a  uniform  (-1, 1)  distribution. 

Ihble  1  shows  that  CHL  was  the  fastest  algorithm  in  terms  of  number  of  epochs, 
fcdlowed  by  the  MEA  rule.  However,  since  the  CHL  rule  requires  two  learning 
phases,  the  MEA  algorithm  had  comparable  speed  in  terms  of  learning  phases. 
There  was  not  a  statistically  significant  difference  between  the  average  number 
of  phases,  using  optimal  stepsises,  for  the  CHL  and  for  the  MEA  rule  (t(38)= 
0.19  ;  p  >  0.05^ 


Learning  rule 


atepaise 


"5hl 


MEA 


81 


0.1  0.01  0.001 
{90%)  72  (100%)  308.5  (100%) 
98.5  (100%)  94  (100%)  785.5  (100%) 


3  CONCLUSIONS 

The  work  presented  here  builds  on  previous  work  on  stochastic  networks  (  Ack¬ 
ley,  Hinton  and  Sejnowski  1985;  Geman  and  Geman,  1984;  Smolensky,  1986)  and 
extends  it  to  the  continuous  case.  In  particular  we  have  explored  the  problem  of 
learning  in  a  stochastic  network  that  works  in  continuous  time  with  continuous 
activation  states.  We  have  presented  learning  algorithms  baaed  on  computa¬ 
tions  of  simple  covariances  that  are  capable  of  learning  problems  that  require 
hidden  units.  One  of  the  advantages  of  noise  in  these  networks  is  that  it  pro¬ 
vides  enough  information  to  calculate  gradients  with  respect  to  weights  without 
explicit  back-propagation  of  error  through  a  network  of  inverted  weights  and 
activation  function  derivatives.  This  aspect  makes  the  learning  rules  attractive 
for  biologically  oriented  simulations  and  for  hardware  implementations. 
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Network  Reciprocity:  A  Simple  Approach  to  Derive  Gradient 
Algorithms  for  Arbitrary  Neural  Network  Structures. 

Eric  A.  Wan*  and  Francoise  Beaufays^ 

Abstract 

Deriving  backpropagation  algorithms  for  time-dependent  neural  network  structures  typically 
requires  numerous  chain  rule  expansions,  diligent  bookkeeping,  and  careful  manipulation  of  terms. 
In  this  paper,  we  show  how  to  use  the  principle  of  Network  Reciprocity  to  derive  such  algorithms 
via  a  set  of  block  diagram  manipulation  rules.  Examples  are  provided  that  illustrate  the  simplicity 
of  the  approach.  Algorithms  are  derived  for  a  variety  of  structures,  including  feedforward  and 
feedback  systems. 

Network  Adaptation  and  Error  Gradient  Propagation 

Adapting  a  feedforward  multilayer  neural  network  structure  amounts  to  finding  the  set  of  vari¬ 
able  weights  W  that  minimizes  the  cost  function: 

J  =  j2e{kMk),  (1) 

k=l 

where  the  sum  is  taken  over  K  samples  in  a  training  sequence,  and  e(ib)  is  the  error  vector. 

In  certmn  problems  {e.g.,  time  series  prediction,  system  identification),  a  desired  output  is 
specified  at  each  time  step;  in  others  {e.g.,  terminal  control),  the  desired  output  is  defined  only  at 
final  time  k  =  K.  Therefore,  we  define  the  error  vector  e(Ji;)  as  the  difference  between  the  desired 
and  the  aM:tual  output  vectors  when  a  desired  output  is  available,  and  as  zero  otherwise. 

According  to  gradient  descent,  the  contribution  to  the  weight  update  at  each  time  step  is 

where  ft  controls  the  leamiing  rate.  Note  we, evaluate  dJ/dW{k)  rather  than  the  instamtameous 
gradient  d{ei^{k)e{k))/dW{k).  This  is  essential  for  the  desired  Network  Reciprocity  result. 

At  the  architectural  level,  a  variable  weight^  Wij  may  be  isolated  between  two  points  in  a  network 
with  corresponding  signals  ai{k)  and  aj{k)  (i.e.,  aj(k)  —  Wij  o,(fc)).  Using  the  chain  rule,  we  get 

dJ  dJ  daAk)  dJ 
dwij{k)  ~  dojik)  dwij{k)  ~  daj{k) 

and  the  weight  update  becomes 

Awij{k)  =  -ft  6j{k)  ai{k),  (4) 

*Dept.  of  Electrical  Engineering  and  Applied  Physics,  Oregon  Graduate  Institute  of  Science  and  Technology, 
P.O.box  91000,  Portland,  OR  97291. 

^Dept.  of  Electrical  Engineering,  Stanford  University,  Stanford,  CA  94305-4055.  This  work  was  sponsored  by 
EPRI  under  contract  RP8010-13  and  NSF  under  grant  NSF  IRI  91-12531. 

*The  general  case  of  a  variable  coefficient,  ay(k)  =  /(u;iy,ai(f^)),  is  treated  in  [5,6]. 
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where  we  define  the  error  gradient 


dJ 

daj{ky 


(5) 


The  error  gradient  Sj(k)  depends  on  the  entire  topology  of  the  network.  Reported  methods 
for  deriving  the  delta  terms  rest  on  chain  rule  expansions,  which  must  be  carefully  applied  for  the 
specific  network  topology.  In  the  next  section,  the  method  of  Network  Reciprocity  is  introduced  as  a 
means  for  finding  the  delta  terms  without  algebraic  derivation.  A  set  of  simple  block  diagram  rules 
are  used  to  construct  a  reciprocal  netuorfc,  which  then  directly  spedfies  the  adaptive  algorithm. 


Construction  of  a  Reciprocal  Network 

An  arbitrary  multilayer  network  can  be  represented  as  a  block  diagram  whose  building  blocks 
are:  summing  junctions,  branching  points,  univariate  functions,  multivariate  functions,  and  delay 
operators. 

The  reciprocal  network  is  constructed  by  reversing  the  flow  direction  in  the  ori^nal  network, 
labeling  all  resulting  signals  Si(k),  and  performing  the  foUowing  operations. 


1.  Summing  junctions  are  replaced  with  branching  points. 


2.  Branching  points  are  replaced  with  summing  junctions. 


3.  Univariate  functions  are  replaced  with  their  derivatives. 


a^k) 


aj(k) 


C=> 


Explicitly,  6i{k)  =  f'{ai{k))  where  /'(oi(k))  =  daj{k)/dai{k).  We  have  included  the 
time  index  k  to  emphasize  the  linear  time-dependent  transmittance.  Special  cases  are: 

•  Wrights,  Gj  =  u>ij  Oj,  in  which  case  =  Wjj  6j. 

a,  - a - -  a,  5,  , - S -  6, 


•  Activation  function:  a„(k)  =  tanh(aj(k)).  In  this  case,  /'(ay(k))  =  1  —  a*  (k). 


4.  Multivariate  functions  are  replac&i  with  their  Jacohians. 


•Jk) , 

“y  — 

F() 

— ►  a, 

« 

1  •J'^>  ^  K^k)  < 

’  — 

fUjk)) 

•  i 

• 

k 

s 

•  1 

Explicitly,  Sin{k)  =  F'(a,„(A;))  where  /^(a,„(A:))  =  daeut(A:)/5a,n(fc)  corresponds  to  a 

matrix  of  partial  derivatives.  For  shorthand,  P{sLin{k))  will  be  written  simply  as  F'(fc).  Note 
this  rule  replaces  a  nonlinear  function  by  a  linear  time-dependent  transmittance.  Important 
cases  include: 

•  Product  junctions,  aj{k)  =  ai{k)  a/(ife),  in  which  case  F*  =  [a{(I;)  o,(A:)]^: 


•  Layered  networks,  in  which  case  the  product  Soutik)  is  found  directly  by 

backpropagating  Stmt  through  the  network: 


a. 


On 

a. 


K 

S. 


5.  Delay  operators  are  replaced  with  advance  operators. 

a(k)  - ►g] - ^  a(k)=a,(k-I)  czi>  8.(k)  =  d.(k+0  ^ -  ^k) 

Explicitly,  Oj(fc)  =  9~^o,(fc)  =  a,(k  - 1)  is  transformed  into  6i{k)  =  q'^^6j{k)  =  6j{k  + 1).  This 
forces  the  reciprocal  system  to  be  noncausal  and  is  crucial  for  maintaining  the  topological 
equivalence  between  the  original  network  and  its  reciprocal.  Making  the  system  causal  is  an 
implementation  issue  that  must  be  addressed  for  each  specific  examples. 

These  5  rules  allow  direct  construction  of  the  reciprocal  network  from  the  original  network. 
Note  there  is  a  topolo^cal  equivalence  between  the  two  networks.  By  reversing  the  signal  flow, 
output  nodes  in  the  orifpnal  network  become  input  nodes  in  the  reciprocal  network.  These  inputs 
are  then  set  at  each  time  step  to  -2e,(k).  The  signals  Sj{k)  that  propagate  through  the  reciprocal 
network  correspond  to  the  terms  dJldaj{k)  necessary  for  gradient  adaptation.  The  exact  equations 
are  ‘*read-out”  directly  from  the  reciprocal  network.  A  formal  proof  that  this  always  provides  the 
correct  derivation  may  be  found  in  [5,6]. 
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Examples 

Backpropagation 

We  start  be  rederiving  standard  backpropagation  [3]  using  the  principles  of  Network  Reciprocity. 
Figure  1  shows  a  hidden  neuron  feeding  other  neurons  and  an  output  neuron  in  a  multilayer  network. 
For  consistency  with  traditional  notation,  we  have  labeled  the  summing  junction  signal  rather 
than  a,-,  and  added  superscripts  to  denote  the  layer.  In  addition,  since  the  multilayer  networks  are 
static  structures,  we  omit  the  time  index  k. 


Figure  1:  Block  diagram  construction  of  a  multilayer  network. 


•  • 


Figure  2:  Reciprocal  multilayer  network. 


The  reciprocal  network  shown  in  Figure  2  is  found  by  appljdng  the  construction  rules  of  the 
previous  section.  From  this  figure,  we  may  immediately  write  down  the  equations  for  calculating 
the  delta  terms: 


-2.i  /'(,f  ) 


0  <  /  <  i  -  1. 


By  Equation  4  the  weight  update  is  ^ven  by 


Awjj  =  -li  b\  aj 


I 


(6) 


(7) 
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These  aire  precisely  the  equations  describing  standard  backpropagation.  It  should  be  emphasized 
that  this  approach  provided  a  formal  derivation  requiring  no  chain  rule  expansions. 

Cascaded  Neural  Networks 

Let  us  now  turn  to  the  more  complicated  example  of  two  cascaded  neural  networks  as  illustrated 
in  Figure  3.  The  inputs  to  the  iirst  network  are  samples  from  a  time  sequence  x{k).  Delayed  outputs 
of  the  first  network  are  fed  to  the  second  network.  Typically,  the  last  network  represents  the  model 
of  some  physical  system,  and  the  first  network  is  used  to  prewarp  or  equalize  the  driving  signal. 


Figure  3:  Cascaded  neural  network  filters. 


The  cascaded  networks  are  defined  as 

u{k)  =  Mi{Wux{k),x{k-l),x{k-2)), 
y{k)  =  jV2(VF2,«(fc),«(fc-l),u(A:-2)), 


(8) 

(9) 


where  Wi  and  W2  represent  the  weights  parameterizing  the  networks,  x{k)  is  the  input,  y(k)  the 
output,  and  u(k)  the  intermediate  signal.  Given  a  desired  response  for  the  output  y  of  the  second 
network,  it  is  a  straightforward  procedure  to  use  backpropagation  for  adapting  the  second  network. 
It  is  not  obvious,  however,  what  the  effective  error  should  be  for  the  first  network.  In  this  case,  the 
chain  rule  is  simple  enough  to  apply  directly  to  find  the  instantaneous  error  gradient: 


de^(k) 

dtVi 


-2e(k) 


dy(k) 

dm 


—2e(k) 


dy(k)  du(k)  dy(k)  du(k  -  1)  dy(k) 
Idu(k)  dm  duik  -  1)  dWi  du{k  -  2) 


dWi 


dWi 


dWi 


du{k-2)' 

dWi 


(10) 

(11) 


where  we  define 


6i{k)  =  -2€(ifc)- 


dy{k) 


1  =  1,2,3. 


'du{k-i) 

The  terms  are  found  simultaneously  by  a  single  backpropagation  of  the  error  through  the  second 
network.  Each  product  ^j^.i(fc)5tt(A:  -  i)ldW\  is  then  found  by  backpropagation  applied  to  the  first 
network  with  6i^i{k)  acting  as  an  error.  However,  since  the  derivatives  used  in  backpropagation 
are  time-dependent,  separate  backpropagations  are  necessary  for  each  Si^\{k).  These  equations,  in 
fact,  imply  backpropagation  through  an  unfolded  structure  as  illustrated  in  Figure  4.  In  situations 
where  there  may  be  hundreds  of  taps  in  the  second  network,  this  approach  leads  to  a  very  inefficient 
adaptation  algorithm. 
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x(k-l) 


Mk-2) 


•0~r*®n 


1m 

*C?FT*0n 


u(k-I)  u(k-2) 


Figure  4:  Cascaded  neural  network  filters  unfolded-in-time. 


A  more  efficient  aJgorithm  for  finding  the  delta  terms  may  be  arrived  at  by  returning  to  the 
method  of  Network  Reciprocity.  The  original  cascaded  networks  are  transformed  into  the  reciprocal 
structure  shown  in  Figure  5.  Simply  by  labeling  the  desired  signals,  gradient  relations  may  be 
written  down  directly: 

Su(k)  =  6i{k)  +  «2(fc  +  1)  +  +  2),  (12) 

with 

Mk)  62{k)  e^ik)]  =  -2e(fc)  Miin{k)),  (13) 

i.e.,  each  6i{k)  is  found  by  backpropagation  through  the  output  network,  and  the  Si’s  (after  appro¬ 
priate  advance  operations)  are  summed  together.  The  weight  update  is  given  by 

(14) 

in  which  the  product  term  is  found  by  a  single  backpropagation  with  Su{k)  acting  as  the  error  to 
first  network.  Equations  can  be  made  causal  by  simply  delaying  the  weight  update  for  a  few  time 
steps.  Clesurly  generalization  to  an  arbitrary  number  of  taps  is  also  straightforward.  This  new 
algorithm  is  far  more  efficient  than  the  earlier  direct  gradient  calculation  method;  we  completely 
avoided  backpropagation  through  a  redundajit  unfolded  network. 
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Backpropagation-Through-Time 

So  fax,  we  have  examined  only  feedforward  structures.  Figure  6  illustrates  a  recurrent  network 
described  by 


y(k)  =  A/'(x(fc),y(*- 1)),  (15) 

where  x{k)  are  external  inputs,  and  y(k)  represents  the  vector  of  outputs  that  form  feedback 
connections.  Af  is  a  multilayer  neural  network.  If  M  has  only  one  layer  of  neurons,  every  neuron 
output  has  a  feedback  connection  to  the  input  of  every  other  neuron  and  the  structure  is  referred 
to  as  a  Julljf  recurrent  network  [9].  Typically,  only  a  select  set  of  the  outputs  have  an  actual  desired 
response.  The  remaining  outputs  have  no  desired  response  (error  equals  zero)  and  are  used  for 
internal  computation. 


Figure  6:  Recurrent  network  and  backpropagation-through-time. 

Calculating  the  gradients  for  such  a  structure  can  be  extremely  complicated.  A  weight  pertur¬ 
bation  at  a  specified  time  step  affects  not  only  the  output  at  future  time  steps,  but  future  inputs 
as  well.  However,  applying  the  Network  Reciprocity  rules  (see  Figure  6)  we  find  immediately: 

Sik)  =  dp(ifc)-2e(fc) 

=  Ar\k  +  l)6ik  +  l)-2e{k).  (16) 

Note  the  causality  constraints  require  these  equations  to  be  run  backward  in  time.  These  are 
precisely  the  equations  describing  backpropagation-through-time,  which  have  been  derived  in  the 
past  using  either  ordered  derivatives  [8]  or  Euler- Lagrange  techniques  [2].  Network  Reciprocity  is 
by  far  the  simplest  and  most  direct  approach. 


Other  examples 

Backpropagation-through-time  has  been  modified  for  a  variety  of  neural  control  problems  [1]. 
Suppose  the  state-space  model  of  a  dynamic  system  is  given  and  a  neural  controller  is  to  be  built 
to  drive  the  plant.  The  overall  structure  is  related  to  the  recurrent  network  seen  in  the  previous 
section,  in  which  y(fc)  is  the  plant  state  vector,  and  x(fc)  is  the  output  of  an  additional  neural 
controller  taking  as  inputs  past  states  y{k  —  1).  Again,  Network  Reciprocity  provides  a  direct 
derivation  of  the  adaptation  algorithm. 

A  more  general  case  is  obtained  by  passing  the  state  vector  and  the  driving  signal  through 
tapped-delay  lines  before  entering  the  controller  and/or  plant  model.  Deriving  the  adaptation 
algorithms  for  the  resulting  ARMA  (AutoRegressive  Moving  Average)  networks  would  be  extremely 
difficult  without  using  Network  Reciprocity. 

Related  to  cascaded  networks  are  structures  that  distribute  time  delays  through  the  entire 
network.  Such  architectures  include  FIR  neural  networks  [6,7]  (where  the  synaptic  connections  of 
the  traditional  multilayer  neural  network  are  replaced  by  FIR  (Finite  Impulse  Response)  filters). 
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time-delay  neural  networks  [4]  (where  time-delays  are  introduced  between  the  hidden  layers  of  a 
feedforward  neural  network),  UR  (Infinite  Impulse  Response)  structures,  and  lattice  filters.  For  such 
networks,  direct  chsun  rule  expansions  or  equivalent  unfolded  structures  are  extremely  complicated. 
In  all  cases.  Network  Reciprocity  provides  a  quick  and  easy  way  to  derive  the  desired  adaptation 
algorithm. 
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Abstract 

A  new  algorWtm  has  been  proposed  which  uses  cooperative  efforts  of  several  identical  neural 
networks  for  efficient  gradient  descent  learning.  In  contrast  to  the  seguential  gradient  descent, 
in  this  algorithm  it  is  easy  to  select  learning  rates  such  that  the  number  of  epochs  for  convergence 
is  minimized.  This  algorithm  is  suitable  for  implementation  on  a  parallel  or  distributed  environ¬ 
ment.  It  has  been  implemented  on  a  network  of  heterogeneous  workstations  using  p4.  Results 
are  presented  where  few  learners  cooperate  and  learn  much  faster  than  if  they  learn  individually. 


1  Introduction 

The  goal  supervised  learning  from  examples  is  generalization  using  some  preclassified  inputs 
(training  set).  Learning  in  neural  networks  is  achieved  by  adjusting  the  connection  strengths 
{weights)  among  processors,  so  that  the  outputs  reflect  the  class  of  the  input  patterns.  One  pop¬ 
ular  method  of  adjusting  the  weights  is  gradient  descent  learning  through  back-propagation  [8]. 
Unfortunately,  in  the  back-propagation  algorithm,  a  number  of  parameters  have  to  be  appropri¬ 
ately  specified.  If  parameters  are  not  appropriate,  the  algorithm  can  take  a  long  time  to  converge 
or  may  not  converge  at  all  [7].  Due  to  local  minimum  problem,  an  appropriate  learning  rate 
significantly  affects  the  quality  of  the  generaUzation  and  the  number  of  epochs  for  convergence  [2]. 
Selection  of  an  appropriate  learning  rate  is  a  computationally  expensive  experimental  problem  that 
can  be  solved  satisfactorily  for  small  networks  only  [5]. 

The  goal  of  this  paper  is  to  speed-up  learning  with  improved  accuracy  using  systems  composed  of 
several  neural  networks  of  the  same  topology  that  concurrently  run  the  standard  back-propagation 
algorithm.  Our  approach  is  different  from  the  approach  in  [6]  where  each  network  learns  a  subset 
of  training  examples.  In  our  system,  the  various  networks  periodically  communicate  with  each 
other  and  cooperate  in  learning  the  entire  training  set.  If  any  of  the  processes  gets  stuck  in  a  local 
miTiimntn  site,  the  rest  of  the  processes  help  in  moving  it  out  of  this  predicament.  The  algorithm 
also  works  well  if  any  process  gets  stuck  in  a  plateau  or  a  ridge. 

In  Section  2,  we  propose  this  new  cooperative  learning  algorithm,  followed  by  experimental 
results  in  Section  3  and  analysis  in  Section  4. 

2  Cooperative  Learning  Algorithm 

In  our  algorithm,  several  processes  run  the  standard  back-propagation  algorithm  concurrently.  All 
processes  work  on  neural  networks  of  identical  topology,  each  using  a  local  copy  of  the  training  set. 
These  processes  are  called  the  slave  processes.  A  master  process  initiates  these  slaves  and  controls 
them.  The  slaves  communicate  only  with  the  master.  The  master  initializes  its  hypothesis  (the 
weights  and  the  bias  values  of  the  neurons)  and  broadcasts  it  to  the  slaves.  The  slaves  adjust  this 
hypothesis  using  back-propagation.  Each  slave  uses  its  own  learning  rate  that  is  different  from  the 
learning  rates  of  other  slaves  and  hence,  the  adjusted  hypothesis  in  each  of  the  slaves  is  different. 

’Research  sponsored  in  part  by  the  NSF  research  grant  NSF-lRI-9308523. 
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Figure  1:  Communication  Network  Topology  for  Cooperative  Learning  Algorithm 


Periodically,  the  slaves  cooperate  by  exchanging  information.  The  period  between  two  cooperations 
is  called  an  era. 

The  algorithm  is  suitable  for  implementation  on  a  distributed  platform  since  the  communication 
graph  is  simple  and  the  total  number  of  communications  is  small  (Figure  1).  Our  implementation 
uses  p4  which  supports  parallel  programming  for  both  distributed  environments  and  highly  parallel 
computers  [1].  It  helps  to  create  the  master  and  the  slave  processes  and  provides  easy  means  of 
communication  between  them.  Another  advantage  of  using  p4  for  neural  networks  implementation 
is  its  ability  to  port  directly  from  a  distributed  to  a  highly  parallel  platform  [4]. 

2.1  Epoch'based  Cooperation 

In  epoch-based  cooperation,  the  slaves  communicate  their  learned  weights  back  to  the  master  after  a 
specified  number  of  epochs  (one  era).  Since  aU  slaves  use  the  identical  topology,  the  master  forms 
a  new  hypothesis  after  each  era  by  averaging  these  weights.  For  each  link  between  neurons,  the 
new  weight  is  the  average  of  the  weights  of  that  link  as  computed  by  the  slaves.  This  hypothesis  is 
broadcast  to  the  slaves  and  they  proceed  with  back-propagation  for  the  next  era  starting  from  this 
new  hypothesis.  When  any  of  the  slaves  has  learned  the  trmning  set  to  satisfaction,  the  hypothesis 
learned  by  this  slave  is  output  and  learning  is  completed. 

2.2  Time-based  Cooperation 

One  disadvantage  of  the  epoch-based  cooperation  is  that  the  slaves  on  faster  machines  finish  their 
era  earlier  but  they  have  to  wut  for  the  slowest  slave  to  finish  its  era.  So,  in  a  heterogeneous 
environment,  the  slowest  machine  is  a  bottle-neck  and  one  cannot  take  advantage  of  faster  ma¬ 
chines.  For  such  heterogeneous  environments,  we  propose  another  approach  called  the  time-based 
cooperation.  Here,  the  era  is  specified  as  a  duration  of  time  rather  than  number  of  epochs.  Since 
all  slaves  run  for  the  same  duration,  no  machine  will  be  idle. 

2.3  Cooperation  with  Dynamic  Learning  Rates 

In  this  approach,  we  start  with  the  cooperative  algorithm  (epoch  or  time  based)  using  initial 
learning  rates  spread  uniformly  in  (0,1)  range.  After  few  eras,  the  range  of  the  learning  rates  is 
reduced.  New  values  for  the  learning  rates  are  chosen  uniformly  around  the  value  of  the  learning 
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Figure  2:  Epoch-based  Cooperation  for  Pattern  Classification  Problem 


rate  of  the  slave  which  currently  generalizes  the  best.  The  advantage  of  this  approach  is  that  the 
selection  of  an  optimal  learning  rate  becomes  completely  automatic. 


3  Results 

Two  benchmark  problems  are  used  for  experimentation.  The  experiments  are  performed  by  varying 
the  number  of  slaves  from  one  to  four.  Both  the  epoch-based  and  the  time-based  cooperation  are 
tested. 


3.1  Pattern  Classification  Problem 

The  problem  is  to  classify  three  patterns,  ‘A’,  T  and  ‘O’,  formed  in  a  4-by-4  grid,  using  a  feedforward 
network.  Figure  2  gives  the  results  of  the  epoch-based  cooperation  for  this  problem.  In  this  figure, 
One  Node  table  gives  the  number  of  epochs  required  to  learn  the  training  set  using  sequential  back- 
propagation  algorithm  with  various  learning  rates.  The  number  of  epochs  to  learn  the  training 
set  using  the  cooperative  system  of  two  slaves  with  various  pairs  of  learning  rates  is  given  in  Two 
Nodes  table.  Here,  one  slave  uses  the  learning  rate  17I  and  the  other  uses  7/2.  Similarly,  other  tables 
show  results  for  cooperative  systems  of  three  and  four  slaves  respectively. 


3.2  Two-Spirals  Problem 

This  hard  benchmark  problem  consists  of  two  classes  of  points  arranged  in  two  interlocking  spirals 
that  go  around  the  origin  [3].  The  goal  is  to  develop  a  feed  forward  network  that  classifies  all 
the  training  points  correctly.  The  results  of  the  epoch-based  cooperative  learning  algorithm  on  a 
training  set  of  40  points  are  given  in  Figure  3. 

Figure  4  gives  results  of  the  time-based  cooperative  algorithm.  In  one  experiment,  the  coop¬ 
erative  algorithm  using  two  slaves  is  run  on  a  homogeneous  system  consisting  of  two  DEC3100 
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Problem  :  Two-Spirala  Problem. 
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Figure  3:  Epoch-based  Cooperation  for  Two-Spirals  Problem 
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Figure  4:  (a)  Homogeneous  and  (b,c)  Heterogeneous  System  for  Cooperative  Learning 
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workstations.  The  slave  on  one  of  the  workstations  uses  learning  rate  rfoECi  while  the  other  slave 
on  the  other  workstation  uses  t;D£Ca-  The  number  of  cooperations  required  for  convergence  using 
various  pairs  of  learning  rates  are  given  in  Figure  4  (a).  In  the  other  experiment  two  slaves  are  run 
on  a  heterogeneous  system  consisting  of  the  faster  HP9000/735  and  the  slower  DEC3100  worksta¬ 
tion.  In  the  pair  of  learning  rates  given  in  Figure  4  (b)  the  left  value  is  used  by  the  slave  on  the 
DEC3100  and  the  right  value  by  the  slave  on  the  HP9000/735. 

Similar  results  are  obtmned  for  training  set  of  80  points.  Here,  the  range  of  good  learning  rates 
for  the  sequential  algorithm  is  smaller  than  for  40  points. 

4  Analysis  of  Experimental  Results 

4.1  Epoch-based  experiments 

Let  r\m,in  the  learning  rate  that  minimizes  the  number  of  epochs  for  convergence  in  standard 
back-propagation.  From  the  experiments  it  can  be  observed  that  if  the  learning  rates  for  the  slaves 
in  the  cooperative  algorithm  are  chosen  such  that  q  <  for  some  slaves,  and  t)  >  rhnin  tor  the 
remaining  slaves,  then,  in  general,  the  cooperative  algorithm  needs  significantly  smaller  number  of 
epochs  to  converge.  For  instance,  suppose  that  there  are  two  slaves  using  learning  rates  rf\  and  rfi. 
In  order  to  get  a  performance  better  than  the  sequential  algorithm,  we  choose  the  learning  rates  rfi 
and  ifi  so  that  <  ’Tmtn  <  For  the  pattern  classification  problem,  it  is  easy  to  see  from  One 
Node  table  in  Figure  2  that  the  fastest  convergence  for  the  sequential  algorithm  takes  1179  epochs 
with  1/  =  0.125.  By  setting  =  0.05  and  tfj  =  0.175,  the  cooperative  learning  algorithm  takes 
only  985  epochs  for  convergence.  Without  any  cooperation,  the  algorithm  takes  1819  and  10000 
epochs  for  convergence  for  1)  =  0.05  and  17  =  0.175  respectively.  Similarly,  in  Figure  3,  the  fastest 
convergence  for  the  sequential  algorithm  takes  1727  epochs  for  17  =  0.3.  With  the  learning  rate 
set  to  0.25  and  0.35  the  non-cooperative  algorithm  takes  2238  and  30000  epochs  respectively.  But, 
with  cooperation,  the  convergence  takes  1777  epochs,  which  is  very  close  to  the  fastest  sequential 
convergence. 

In  sequential  back-propagation,  learning  rates  less  than  and  greater  than  TTmin  exist  if  the  number 
of  epochs  for  convergence  is  a  non-monotonic  function  of  the  learning  rate,  which  is  true  for  many 
real-life  problems.  For  these  problems,  cooperative  algorithm  will  work  better,  provided  appropriate 
learning  rates  are  selected.  The  XOR  problem  is  an  example  where  the  number  of  epochs  is  a 
monotone  decreasing  function  of  the  learning  rate.  So,  for  this  problem,  cooperative  learning  does 
not  ^ve  a  better  performance. 

4.2  Time-based  experiments 

Here,  the  time  between  two  cooperations  (one  era  is  400ms  in  our  experiments)  is  fixed.  So,  the 
total  time  for  convergence  of  the  time-based  cooperation  is  proportional  to  the  product  of  the 
number  of  cooperations  and  the  execution  time  of  one  era.  From  the  Figure  4,  it  can  be  observed 
that  the  time-based  cooperation  executed  on  a  heterogeneous  system  with  one  fast  and  one  slower 
machine  converges  much  faster  than  on  a  homogeneous  system  with  two  slower  machines.  Also,  the 
algorithm  is  more  efficient  if  the  slave  with  the  higher  learning  rate  is  assigned  to  the  faster  machine 
(see  Figure  4  b,c).  It  is  clear  that  the  slave  on  the  faster  machine  executes  more  epochs  per  era 
than  the  slave  on  the  slower  machine.  So,  if  the  slave  with  the  smaller  learning  rate  is  assigned  to 
the  faster  machine,  the  weights  computed  by  the  two  slaves  are  not  very  far  apart.  Consequently, 
averaging  is  not  so  beneficial  in  this  case. 
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5  Conclusion 


Th«  cooperative  learoing  algorithm  proposed  here  has  pven  promising  results.  In  general,  for  the 
bsKk-propagation  algorithm,  it  is  very  hard  to  find  learning  rates  for  which  the  algorithm  converges 
in  minimam  number  of  epochs.  In  our  algorithm,  we  can  easily  select  the  learning  rates  such  that 
the  number  of  epochs  for  convergence  is  close  to  this  minimum  or  even  better.  This  approach  can 
be  used  to  improve  any  gradient  descent  algorithm.  It  can  be  easily  implemented  on  a  parallel 
machine  or  a  network  of  heterogeneous  workstations  using  p4. 

The  experimentation  using  cooperation  with  dynamic  learning  rates  is  still  under  investigation 
with  promising  preliminary  results.  We  are  also  experimenting  with  a  more  sophisticated  way  of 
combining  slave  hypotheses  (instead  of  averaging),  which  might  further  improve  the  performance. 
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Abstract 

This  paper  presents  a  new  top-ilown  method  for  growing  neural  tree  networks  using  a  gradient  search  for  the 
beat  hyperidane  that  maximizes  the  informatian  measme  called  AMIG  (average  mutual  infixmation  gain).  Previous 
information  based  learning  methods  ate  limited  to  binary  classification  ptobl^  and  requites  square  computatioo 
time  for  calculating  the  probabilities  by  using  activatko  values  oi  all  trainiog  data  for  each  learning  qroch.  Our 
method  removes  th^  restrictions  to  allow  lesniing  in  multicategory  proUems  and  has  linear  conqwtatkm  time  for 
calculating  the  probdrilities  by  using  the  incrementtl  estimatioa  for  each  learniiig  qpoch.  We  give  the  interpretation 
to  the  new  learning  method  as  compued  to  the  delta  rule  or  LJMS  rule.  A  set  of  experimental  results  is  presented  to 
demonstrate  the  ptrfcrmance  of  the  proposed  neural  tree  network. 

1.  Introdnction 

Neural  tree  networks  are  feedforward  networks  that  perform  classification  in  a  manner  similar  to  the  traditional 
decision  tree  classifiers.  There  are  several  methods  for  the  design  of  neural  tree  netwotkVdedsion  tree  classifiers 
using  cooeqM  of  information  theory  ([SeS82].  IQuttSj,  [BiS89].  [CiL92].  [Set91],  [SeY93a.  b]). 

Information  gain  has  been  used  as  the  object  fimetion  of  optimization  to  generate  neural  networks  sequentially 
([BiS891.  [CiL92])  and  to  prune  neural  networks  [FaE92].  Bictol  and  Seiu  [BiS89]  propose  a  mediod  to  generate 
neural  nttrwtks  of  hard  nonlinearity  neuron.  They  provide  a  IcMBing  algorithm  whidt  combines  a  structured  pattern 
search  and  a  simulated  annealing  the  modified  object  fonctioo  of  itformation  gain.  They  also  suggest  ^  the 
gradient  method  using  soft  nonUMarity  neurons  is  preferdde  wbeo  all  local  minimums  are  close  in  value  to  the 
global  minimmn.  This  suggestion  has  been  adopted  by  Cios  and  Lin  [CiL92]  to  generate  feedforward  neural 
networks  similar  to  the  cascade  correlation  net  of  lUiImaa  and  Lduere  [FaL90].  However,  these  methods  ([BiS89], 
[QL92])  are  limited  to  binary  classificatian  problems  and  required  to  have  square  computation  time  for  calculating 
the  probabilities  by  using  activation  values  of  all  training  <taa  for  each  learrung  qxxfa.  In  contrast,  the  delta  rule 
that  minimizes  the  IMS  qeast  mean  square)  error  has  Biiearcompiaation  time  fa  calculating  activadon  values  for 
eadi  lean^  qioch.  Our  method  removes  these  restrictioos  to  allow  learniqg  in  multicalegory  problem  with  linear 
corrqiotatian  time  for  calculating  the  probabilities  for  eadi  learning  epoch  of  training  data  presentation. 

Our  usage  of  the  mutual  information  gain  of  a  partitioo  is  based  on  the  fiamework  of  the  AMIG  tree  induction 
algorithm  of  Sethi  and  Sarvatayudn  [SeS821.  The  AMIG  learning  algorithm  based  tree  induction  method  b  capable 
generating  good  multicategoty,  multifoature  split  neural  trees  irrespective  of  class  population  unbalance  in  the 
training  data. 


2.  Learning  mic  using  the  information  gain  as  the  ttirject  fnnetioB 

Our  weight  adaptation  rule  is  baaed  on  the  gradient  search  of  the  best  hyperplane  which  will  maximize  the 
AMIG  informatioo  measure  of  partitioning.  The  derivation  of  the  wei^t  adqttatfai  rule  is  fully  discussed  in 
[Yoo93].  Given  patterns  firom  C  classes  and  a  partitioning  P  that  divides  the  pattern  space  into  X  mutually 
exclusive  partitions,  the  AMIG  information  measure  of  partitioning,  1(1*).  is  written  as: 

(I) 

W  H 

We  use  the  srgmmd  function  to  the  gradient  search  in  the  wei^t  adaptation  rule  as 

g(x)  »  1  /  (1  +  exp(-2^),  (2) 

where  the  feature  vector  and  tire  weight  vecttv  ate  angmented. 

We  treat  the  sgmttid  function  as  a  continuous  or /any  count  of  a  frequency  fa  the  estimation  of  the 
probaUlity  that  the  given  input  bdongs  to  the  positive  re^on  ^  the  hyperplane.  In  ctmtrast.  die  sign  function  is  a 


*  The  material  presented  here  is  based  upon  the  work  supported  by  the  National  Science  Foundation  under  the  grant 
IRI-9002087. 
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baiaty  or  cri^coiait  of  a  firequency  for  the  estiiiuttion(^  the  same  probability  as  suggested  in  [BiS89].  There  is  a 
similar  fu^  ooont  of  a  hequency  m  the  so-called  "fuzzy  k-nearest  neighbor  classifiers*  ([J(a83].[KiH92]).  Using 
this  sigmoid  functioii,  we  can  estimate  the  approprime  probabilities  as 

p(Ci)»  i  I/N».d.  (3) 

««Cj 

P(n.  ci)=  2  g(*)/Nu«d.  (4) 

««Cj 

g(*) /N»ui.  (5) 

•Bs 

/>(n.Cj)  =  p(Cj)-/)(i4,Q).  (6) 

p(Tt)’‘i-p(T^.  (7) 

We  can  maximize  the  informatioa  measure  of  hypeiplane  partitioning  (/t  s  2)  or  equivalently  minimize  the 
conditional  class  entropy  by  using  the  gradient  search  on  the  continuous  {Hobability  defined  by  the  sigmoid 
function.  The  gradient  contains  the  following  four  components  for  each  wmght  variaMe  w^: 

^  =  log2p(Qln) 

OWk  M  H 

^  ^  ^p(ri»Cj) 

•  .  V 

J. 


—  -^p(ri.Cj) 


>0&P(Cj) 

3f>(Cjlc) 


(8) 


M  H  log.  2  pCqIrO  dwtc 

A|;p(ii.Cj)  1  c|p(Ci) 

log.2  p(q)  dwk  ■ 

The  second  term  ansounts  to  zero  since  p(ri,  c^*p(t2,  cj)  =/Kcj)  is  an  invariant  of  the  wei^t  change.  The 
third  term  amounts  to  zero  since  both  the  summation  the  jmnt  probabilities  />(n>  cj)  and  the  summation  of  the 
marginal  probabilities /Krj)  equal  one.  The  last  term  also  amounts  to  zero  since  ^cj)  is  an  invariant  of  the  weight 
change.  We  write  the  weight  updatmg  rule  as: 

Awk  =  -^2g(xXl-g(*))xk(log^^),  (9) 

Ni«i  iKcjti) 

where  p.  is  the  learning  coefficient  This  weight  updating  rule  gives  us  the  extension  of  Cios  and  liu’s  formulation 
[CiL92]  with  regard  to  the  multicategory  problem.  Moreover,  we  can  give  the  inteipr^tion  of  die  weight  change 
rule  when  compared  to  the  nonlinear  delta  rule  by  using  sigmoid  functioo: 

Awk»p2g(xXl*g(x))xk(T'g(x)).  (10) 

The  desired  ouqmt  T,  is  fixed  in  dw  ddta  rule  for  each  class.  However,  there  is  no  explicit  desired  output  in  the 
new  learning  rule,  fnsiead.  the  ratio  of  conditianal  probabilities,  p(r2 1  cj)/p(n  I  cp  decides  the  majority  r^on  for 
each  class.  That  is,  if  the  ratio  is  greater  than  one,  region  t2  will  ic  the  majority  region  for  the  class.  Otherwise  if 
the  ratio  is  less  dian  one,  region  ri  will  be  the  majority  region  for  die  clas^  Tto  ratio  gives  us  a  modification  to 
the  usual  decision  of  finding  a  majority  region  wh^  is  based  on  the  ratio  of  joint  probabilities,  p(r2,  cj) /p(xi,  cj) 
aSeY93a],(SeY93bl). 

When  the  ratio  of  conditirmal  probabilities  equals  one,  it  is  necessary  to  Ineak  the  symmetry  by  using  the 
probability  estimation  of  the  law  of  succession  [BoB74],.  If  the  ratio  oi  conditional  probabilities  still  equals  one 
and  the  information  measure  nearly  equals  zoo,  we  will  rqdaoe  the  log  ratio  factor  by  a  random  numb»  in  (-1, 1). 
If  the  activation  function  g(x)  is  saturated  to  eidier  0  or  1  and  the  infotinatioo  measure  nearly  equals  zero,  tl^  die 
gradient  will  equal  zero.  Tto  state  corresponds  to  the  flat  r^ion  the  energy  surfoce  discovered  by  Bichsel  and 
Seitz  [BiS89].  The  object  function  can  be  motfified  to  remove  the  flat  region  such  that: 

1(10 +  e 

where  e  has  a  small  positive  value.  If  the  activation  function  g(x)  is  saturated  to  dther  0  or  1  and  the  infimiuition 
measure  nearly  equals  zero,  then  the  wdght  updming  rule  becomes: 

Awk  =  2p  (0.5  -  g(x))  Xk - E - .  (12) 

(I(p)  +  e) 

After  taking  limit  e  •>  0  with  I(p)  «  e,  we  have 

Awk  =  2p  (0.5  -  g(x))  Xk  .  (13) 

We  give  the  interpe^Uion  on  this  weight  iqidming  rule  as  compared  to  the  delta  rule.  It  is  difficult  for  the  delta 
ruletoescapetheflatregionofanenergy  surfoce  since  the  activation  function  g(x)  is  saturated  to  eidier  0  or  1  and 
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the  amount  of  weight  change  is  close  to  zero;  whereas  it  is  not  difficult  for  the  AMIG  delta  rule  since  the  saturated 
activatioa  function  gives  the  maximum  momentum  for  escaping  die  flat  region.  The  sign  of  the  factor  (O.S  -  g(x)) 
of  equation  (13)  points  die  direction  toward  the  center  of  the  data  cluster  of  class  cj  in  the  projection  of  xi^  feature. 

After  adjusting  the  scaling  factor  of  equation  (9)  to  be  compiuable  with  that  of  the  LMS  rule  of  equation  (10), 
we  have: 


Awk  ’ 


-  2  g(xKl-  g(x))  Xk  log/^^*^^ 
p(Cjlri) 


(14) 


10g2^4^(NuK.l-N(Cj)  +  Q 
N(Cj)  +  C 

Here,  we  use  the  probability  estimation  of  dm  law  of  succession  to  arrive  at  the  maximum  ratio  (Ntotal  *  N(cj)  +  c) 
(N(cj)  1)  /  (N(cj)  -t-  C)  in  order  to  scale  the  LRCP  factor  within  the  interval  (-1.0, 1.0).  With  these  two  weight 
upd^g  rules  (equations  (13)  and  (14)).we  are  able  to  comidete  the  learning  method  by  using  the  gradient  search  of 
maximum  information  measure  or  minimum  conditional  class  entropy. 

The  calculation  of  the  probabilities  takes  0(m^)  times  (m  is  the  number  of  the  training  examples  )  for  each 
iteration.  This  is  because  of  batch  estimation  of  the  probabilities  using  all  training  examples.  We  see  that  it  is 
more  desirable  to  have  the  linear  time  to  compute  the  probabilities  by  using  the  previous  estimation.  Applying  the 
recurrence  relation  of  the  estimation  of  a  sample  mean  m  after  introducing  n+I  th  sample  Xn+i  [DuH73]  to  the 
class  mean  of  sigmoid  function 

Z  8(x) 

_ _p(ra.Cj)NKmi 

p(Cj)Nk]ul 


<g(K)>*- 


we  have 


I  » 

»«Cj 


(15) 


p(.r%  Cj)BBw  -  Cj)ow  + - i - P(cj)g(x) . 

p(Cj)Nu«i  +  l  p(Cj)Nu«i+l 


(16) 


Based  on  the  above  discussion,  our  new  learning  algorithm  is  as  foUows: 

Algorithm  1  The  gradient  search  of  the  maximum  inf(»mation  measure  using  the  incremental  estimation  of 
probabilities. 


Input :  Training  examples  (xi^,  Ik)  where  xk  is  an  augmented  vectcn  and  Ik  is  its  class  label  being  Ik  €  {1,2 . 

ck). 

Output :  A  weight  vector  w  that  best  divides  all  training  examples  into  two  groups. 

1.  Fcht  each  class,  calculate  p(cj)  and  the  max  log  ratio  of  conditional  probabilities,  LRCP  max  (cp  «  log2 
mptal  -  N(cp  +  c)  (N(cp  +  /)  /  (N(cp  +  C)). 

2.  Initialization:  Randomly  initialize  the  current  weight  vector,  w.  Initialize  the  iteration  counter  to  zero. 
Calculate  I(fO>  the  information  measure  for  the  initiid  weight  vector. 

3.  Do  the  following  until  the  weight  change  (WC)  is  less  than  the  tolerance  or  the  iteration  counter  is  over  the 
limit. 

3.1.  Set  iterationj:ount  =  iteration_count  + 1.  Set  WC  to  zero. 

32.  Shuffle  the  training  data. 

3.3.  For  all  training  examples,  do  the  following. 

3.3.1.  Calculate  p(t2«  cj)  hy  using  previous  estimation  and  sigmoid  function. 

3.3.2.  Calculate  p(ri,  cj),  ;Kt2)  and  p(ri). 

3.3.3.  Calculate  l(,P),  the  information  measure. 

3.3.4.  Calculate  LRCT  (the  log  ratio  of  conditional  probabilities),  log2(p(cj  I  r2)  /  p(cj  I  ri) )  by  the 

law  of  succession. 

3.3.5.  If  1(F)  =  0.0  and  g(xk)  (1  -  g(xk))  *  0.0  then 

3.3.5.1.  Set  dw  ;=  2p  (05  -  g(xk))  xg,. 

3.3.6.  Else  if  1(F)  =  0.0  and  LRCP  =  0.0  then 

3.3.6.1.  SetLRCF  ;=  random(-1.0. 1.0). 

3.3.6.2.  Set  dw  :=  2p  g(xk)  (1  -  g(x0)  xfc  LRCP. 

3.3.7.  Else 

3.3.7.I.  Set  dw  :=  2p  g(xk)  (1  -  g(xk))  x*  LRCP  ILRCPmax(cp. 

3.3.8.  Set  w  :=  w  +  dw. 

3.3.9.  Set  WC  =  WC  +  norm(dw). 

End  of  Algorithm  1. 
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3.  Using  AMIG  learning  rule  for  neural  tree  networks 


In  this  section,  we  describe  a  top-down  design  method  for  generating  muldfeature-^lit  decision  trees  or  neural 
tree  networks  for  problems  having  data  from  multiple  classes.  The  suggested  method  can  be  considered  as  an 
extension  of  the  AMIG  dedaon  tree  algorithm.  To  q)ecify  a  top-down  decision  tree  induction  method,  three  a^iects 
need  to  be  considered:  (1)  a  method  for  generating  and  evaluating  possible  multifeature  splits.  (2)  a  criterion  for 
detemining  when  to  stop  the  tree  horn  growing,  and  (3)  a  way  for  assigning  a  class  label  to  each  terminal  partition 
w  node  of  the  decision  tree.  The  last  aq)ect  of  a  top-down  decision  tree  design  method  is  the  simplest  to  d^  with. 
Invariably,  the  majority  rule  (i.e..  the  class  most  heavily  represented  in  a  terminal  partition)  is  used  to  label  the 
terminal  nodes.  Our  method  ali»  fdlows  the  same  majority  rule.  We  discuss  the  remaining  two  aq[)ects  below. 

B^iiuung  with  the  root  node,  the  AMIG  delta  learning  algraithm  of  the  previous  section  is  used  in  our  tree 
induction  method  to  genmte  the  best  multifeature  splits  successively.  Each  selected  split  gives  rise  to  two  data 
subsets  each  of  which  is  again  handled  in  a  similar  fashion.  As  tree  induction  jRoceeds,  the  number  of  a^^licable 
traitting  vecttxs  for  the  partitiexting  algorithm  starts  decreasing.  With  decreasing  sample  size,  the  issue  of  properly 
estimating  the  infumation  measure  becomes  important  {BuB74].  In  our  method,  we,  therefore,  use  the  following 
exptesskm  for  estimating  the  information  measure  of  a  partititm 


2  c 


(n.V»-l)lV 


mmN  (row(xi)  +  c)  colic,) 


(17) 


where  n/ j  is  number  of  examples  in  the  partition  x,  with  class  cy,  row(xi)  is  number  of  examples  in  the  partition 
xi,  colic j)  is  number  of  examples  in  the  class  cy.  and  N  is  total  number  of  training  examples. 

To  detnmine  when  to  stop  tree  growing,  we  follow  a  two  step  aq^rroach  that  combines  contrcdled  tree  growth 
with  pruning.  This  ^rproach  involves  diving  tte  available  training  patterns  into  two  subsets.  One  subset  of 
traiiung  patterns  is  u^  to  develcq)  the  tree  to  the  deared  extent  The  other  subset  is  used  to  imine  the  grown  tree. 
To  control  the  growth  of  the  tree,  we  follow  the  procedure  used  in  the  AMIG  aigmithm.  According  to  this 
procedure,  the  miiumum  ariKMint  of  information  that  most  be  provided  by  the  tree  is  given  as 

e 

/■M  *  -  X  pic,)  lOg2PiCj)  +  Pt  logipt  +  (1  -  P.)  lOgl  (1  -  pi)  -  p,  lOg2  (C  -  1)  (18) 

>1 

where  is  the  acceptable  oror  rate.  Letting  as  the  amount  of  information  associated  with  the  weight  vector  of 
the  k-tk  interruil  node,  the  cumulative  information  given  by  a  tree  having  L  terminal  nodes  is  given  as 

/(7)«Xft/*  (19) 

M 

where  Sjt  is  the  fraction  of  training  examples  that  pass  through  the  k-th  int»nal  node.  Thus  by  keeping  a  check  on 
die  cumulative  inftHmaticHi  measure  the  tree,  we  determine  when  to  su^  tree  growing.  Once  the  tree  is  grown  to 
the  extent  that  it  provides  infc»mation  miceeding  Imin>  prune  it  using  ^  pruning  subset  of  the  training  patterns 
to  size  T,Tic  with  c  being  the  number  of  classes,  with  the  pruning  criterion  being  that  the  pruned  tree  should 
have  least  possible  performance  difference  on  tree  growing  and  pruning  subsets  of  the  training  (bta.  We  first  search 
the  pruning  point  that  is  the  first  minimum  after  the  numbo’  of  categories  -  two  way  minimum  •  in  the  total  tree 
perfrumance  graph.  If  there  is  no  such  pruiting  point,  we  find  the  maximum  jump  after  the  number  of  categories  - 
one  way  miiumum  -  of  performance  difretence  <m  tree  growing  and  pruning  subsets  of  the  training  data.  We  have 
found  this  simple  pruning  to  provide  trees  of  good  clarification  accuracy  compared  to  more  expensive  pruning 
inethods([BF084].  [GRD91]). 

Although  the  AMIG  delta  learning  based  algmidun  has  a  smoodier  energy  surface  than  the  LMS  rule  as  shown 
in  [Yoo93],  it  is  not  guaranteed  to  fiiid  a  tree  having  perfrxmance  equal  to  or  better  than  that  of  the  single  feature 
split  decision  trees.  This  is  due  to  the  fact  that  the  AMIG  delta  rule  can  be  stuck  to  a  local  optimal  solution. 


4.  Performance  Evaluation 


In  this  section,  we  report  tm  two  experiments  that  woe  conducted  to  evaluate  the  performance  of  the  suggested 
multiclass  multifeature  split  neural  tree  procedure  using  the  AMIG  delta  learning  algorithm.  In  each  experiment,  the 
accqitable  etn»-  rate  was  set  to  zero  to  control  tree  growth.  Unless  stated  otherwise,  all  the  available  training  vectors 
were  used  for  tree  growing  and  all  available  test  vectors  were  used  fex  tree  pruning.  In  all  experiments,  the 
classification  performance  was  measured  as 

(20) 

C^I  jjml 

where  bjk  is  the  Boolean-valued  classification  score  of  the  k-th  sample  of  the  j~ih  class.  This  equally  weighted  class 
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avenge  of  conect  classificadon  yields  a  more  meaningful  performance  measure  compared  to  the  simple  correct 
classification  count  measure  because  the  performance  of  size  one  tree  is  always  i/C  where  C  is  the  number  of 
pattern  classes.  At  each  stage  of  tree  building,  the  iteration  count  limit  fw  AMIG  delta  leanung  was  set  to  iO  * 
max  {h,  d)  where  n  is  the  number  of  available  traiiung  examples  at  a  given  tree  stage  and  d  is  the  dimension  of 
training  vectors.  To  compare  different  trees  on  the  basis  of  their  size  attd  linearity  of  the  structure,  avmge  tree  size 
was  calculated  for  each  case  using  the  following  relidonship 

r^  =  ft4  (21) 

where  L  is  the  number  ttf  tominal  nodes  in  a  tree  and  dj  is  the  depth  of  the  j-th  terminal  node. 

To  carry  out  the  assessment  of  the  prcqiosed  method,  three  methods  woe  used  for  each  data  set  These  are;  (1) 
The  single  feature  ^lit  AMIG  procedure.  (2)  The  multifeature  ^lit  decision  tree  using  AMIG  delta  learning  rule, 
and  (3)  baclqntqiagation  networic  with  one  hidden  layer. 

The  first  experiment  was  designed  to  evaluate  the  perfumance  of  the  proposed  method  in  a  problem  with 
highly  uneven  class  populations.  This  experiment  was  p^onned  using  THINNING  data  set  which  was  generated 
by  the  aiqtlication  of  st^  one  of  the  thinning  algorithm  due  to  Zhang  and  Suen  |ZhS84].  The  THINNING  dau  set 
consists  of  256  8-bit  feature  vectors  that  represent  various  possibilities  of  an  8-neighb(Miiood  in  a  binary  image 
with  the  central  pixel  of  the  neighborhood  being  one.  The  cl^s  label  for  each  combination  represents  the  thinning 
decision  whether  the  central  point  of  the  neighborhood  should  be  marked  for  deletion  or  not  Of  the  256 
combinations,  there  are  222  combinations  for  which  the  cratral  point  is  marked  for  deletion;  the  remaining  34 
combinations  correspond  to  maintaining  the  central  point  The  entire  data  set  was  used  as  the  training  data  in  this 
experiment  Table  1  shows  the  performance  of  the  four  tree  methods.  The  average  size  column  in  Tables  1-2 
contains  two  entries.  The  first  entry  denotes  the  number  of  terminal  nodes  and  the  second  entry  denotes  the  value  of 
2'avg  defined  in  equation  (21).  The  backpropagation  network,  BP  has  only  one  entry  denoting  the  number  of 
nodes  in  the  hidden  layer.  It  is  seen  from  the  Table  1  diat  the  proposed  procedure,  ADR  has  a  comparable 
performance  to  baclqm^gation  network,  BP.  While  the  single  feature  decision  tree.  AMIG,  provides  slightly  lower 
performance  compared  to  ADR,  the  suggested  procedure,  it  has  much  higher  number  of  terminal  nodes  and  largo- 
average  tree  size. 

The  second  experiment  was  designed  to  evaluate  the  perfnmance  of  the  proposed  method  in  multicategory 
problems  using  VOWEL  data.  The  VOWEL  data  set  represents  a  difficult  classification  task  of  speaker  independent 
recognition  of  the  11  steady  state  vowels  [Rob89].  It  consists  of  utterances  from  15  speakers,  eight  males  and  seven 
females,  each  repeating  six  times  each  vowel.  Each  utleraiKe  constitutes  a  pattern  in  the  fom  of  a  10-dimensional 
vector  whose  components  are  based  on  log  area  ratios  derived  from  linear  predictive  analysis.  The  entire  set  is 
divided  into  two  subsets  of  528  training  vectors,  cotre^nding  to  four  male  and  four  female  speakers,  and  462  test 
vectors,  cturesponding  to  remaining  speakos.  The  nearest  neighbo-  recognition  rate  fra-  this  data  is  ^.28%  in  the 
original  data  arid  49.13%  in  die  nottnaUzed  data.  The  results  of  the  second  experiment  are  surtunarized  in  Table  2  for 
the  VOWEL  data.  It  is  observed  duu  the  AMIG  delta  learning  rule,  ADR  has  better  petfnmance  than  that  of  the 
single  feature  decision  tree.  AMIG.  The  BP  net  results  are  similar  to  those  of  Robinson’s  Ph.D.  dissertation 
[Ron89]. 

Summarizing  the  results,  it  can  be  seen  that  the  proposed  AMIG  delta  rule  based  multiclass,  multifeature 
decision  tree  induction  method  (ADR)  gives  another  compa^le  way  of  neural  tree  network  construction. 

5.  Conclusion 

A  method  for  growing  neural  trees  has  been  described  in  this  paper.  This  method  is  based  on  a  modified  delta 
rule  called  the  AMIG  delta  rule  that  performs  the  gradient  ascent  search  on  the  AMIG  object  function  in  a 
multiclass  environment  The  best  two-way  grouping  is  automatically  decided  by  the  ratio  of  conditicmal  class 
probabilities  due  to  a  partitioning  hyperplatw  and  updated  by  the  incremental  estimation  of  probabilities.  It  is 
contrast  to  the  conventional  majtnity  region  criteritm  by  the  ratio  of  joint  class  probabilities  due  to  a  partitioning 
hyperplane.  The  log  ratio  of  conditional  class  probabilities  doe  to  a  partitioning  hyperplane  (LRCP)  rqilaces  the 
factor  of  (Target  -  Output)  in  the  usual  delta  rule.  The  incremental  estimation  of  laobabilities  make  the  AMIG  delta 
rule  comparable  to  the  delta  rule  in  terms  of  the  time  complexity.  The  perfcnmance  of  the  i»oposed  method  has 
been  rqiorted  using  two  data  sets,  hi  each  case,  it  has  been  drawn  that  the  suggested  method  has  better  peribimance 
than  th^  of  comparable  methods. 
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TaMe  1.  Performance  results  for  THINNING  data. 


AvgSize 

Accuracy 

AMIG 

45. 6.0 

9538 

ADR 

17.  3.85 

97.17 

BP 

12 

100.00 

Table  2.  Performance  results  for  VOWEL  data. 


Before  nruninji 

AfierpruninK  1 

AvgSize 

Training 

Testing 

AvgSize 

Training 

Testing 

AMIG 

25,4.96 

74.81 

37.88 

23.  4.87 

73.29 

38.09 

ADR 

13.  3.85 

84.28 

50.65 

12,  3.67 

83.52 

50.65 

BP 

22 

82.95 

48.26 

m-401 


An  Adaptive  Structure  Neural  Network  Using  an  Improved 
Back-^'^ropagation  Learning  Algorithm 


K.  Khorasani  and  H.F.  Yin 

Depai  tment  of  Electrical  and  Computer  Engineering 
Concordia  University 

1455  De  Maisonneuve  Blvd.  W.,  Montreal,  Quebec,  H3G  1M8,  Canada 


[Abstract]  An  improved  back- propagation  algorithm  is  proposed  in  this  paper.  The  initial  weights 
and  learning  rates  are  set  differently  for  individual  hidden  layer  units.  In  this  algorithm  the  variance  of 
the  hidden  layer  units  are  set  to  be  different.  An  error  analysis  is  given  for  removing  the  unnecessary 
hidden  units  from  the  network.  A  procedure  for  dynamically  adjusting  the  structure  of  the  network 
is  proposed.  Numerical  examples  are  given  to  illustrate  the  utility  of  the  proposed  methods. 

1  Introductiuii 

Back-propagation  (B-P)  algorithm  is  the  most  commonly  used  neural  network  model  [1,2].  Back- 
propagation  allows  us  to  train  the  weights  in  a  feedforward  network  of  arbitrary  structure  by  following 
a  gradient  steepest  decent  path  in  weight  space,  where  the  energy  surface  is  usually  defined  by  the  mean 
squared  error  between  desired  and  actual  outputs  of  the  network.  There  have  been  many  examples  of 
successful  use  of  back-propagation  for  performing  different  tasks  [3,4,5]. 

Unfortunately,  back-propagation  has  some  problems.  Firstly,  the  energy  surface  may  have  many 
local  minima,  so  the  algorithm  can  not  always  be  guaranteed  to  converge  to  the  optimal  solution.  The 
second  problem  is  that  it  is  difficult  to  analyze  the  behavior  of  hidden  units  in  a  multilayered  network. 
Consequently  it  is  not  easy  to  estimate  the  exact  number  of  the  hidden  units  required  for  a  given 
problem  before  the  network  is  trained.  The  third  problem  is  that  back-propagation  algorithm  is  often 
slow. 

The  weights  of  the  network  after  training  depend  on  several  factors.  Among  them  one  may  mention 
the  randomly  chosen  imtial  weights  and  the  sequence  of  training  examples.  The  hidden  units  have 
approximately  equal  variance  [6,7].  For  some  problems,  such  as  image  coding  and 

compression,  these  factors  may  reduce  the  usefulness  of  the  B-P.  This  is  due  to  the  fact  that  the 
bits  must  be  allocated  evenly  among  the  weights  of  the  network  and  noise  cannot  be  eliminated  by 
removing  the  units  with  lowest  variance.  If  the  network  is  designed  with  too  many  hidden  units  then 
the  additional  error  introduced  is  spread  evenly  throughout  the  units  and  cannot  be  easily  detected  or 
removed  by  looking  at  the  signal  to  noise  ratio  of  the  individual  units.  If  the  network  does  not  have 
enough  hidden  units,  then  the  learning  procedure  may  never  converge.  There  are  algorithms  in  the 
literature  [8,9]  that  can  add  or  delete  hidden  units  from  the  network.  However,  it  is  not  easy  to  decide 
when  and  where  the  structure  of  the  network  should  be  changed.  Since  the  variance  of  the  hidden 
units  is  at  the  same  level,  a  large  error  could  be  introduced  by  removing  any  hidden  unit  from  the 
network. 

In  this  paper,  we  give  an  improvement  to  the  back-propagation  algorithm(IB-P).  For  a  three  layer 
network  with  one  hidden  layer,  the  initial  variance  of  the  weights  and  the  learning  rates  are  set 
differently  for  different  hidden  layer  units.  The  algorithm  results  in  the  hidden  units  having  different 
variances,  therefore,  the  hidden  layer  units  have  different  degrees  of  importance.  A  procedure  for 
dynamically  adjusting  the  network  architecture  is  also  proposed.  Application  of  the  proposed  methods 
to  solve  pattern  recognition  and  function  approximation  problems  are  demonstrated. 
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2  The  B-P  Algorithm  and  its  Improvements 

The  back-propagation  learning  algorithm  is  summarized  by  the  following  equations.  The  forward  prop¬ 
agation  algorithm  is  defined  as  netpj  =  The  back- propagation  algorithm  is 

ApWji  =  TfSpjOpi-  The  error  signal  is  given  by 

S  =  i  —  Opj)f'(ne{pj)  If  the  neuron  is  an  output  unit 

I  If  the  neuron  is  not  an  output  unit. 

The  B-P  algorithm  is  improved  as  follows:  First,  we  set  the  initial  weights  randomly  with  different 
variances.  For  the  weights  that  connect  to  the  hidden  layer  unit  k,  the  variance  is  set  to  14  =  Vo/a'^ 
where  0  <  a  <  1.  The  weights  can  be  produced  by  any  random  distribution.  In  above  equations,  the 
learning  rate  t]  is  constant.  We  also  choose  the  learning  rate  to  be  different  for  different  hidden  layer 
units.  For  a  three  layer  network  with  one  hidden  layer,  the  weight  adjustment  for  hidden  neuron  k 
is  changed  into  ApWji  =  TJk^pjOpi.  The  learning  rate  T)k  is  now  given  by  %  =  %/q:*,  where  %  is  a 
positive  constant  and  0  <  o  <  1. 

In  this  algorithm,  the  weights  that  connect  to  the  first  hidden  units  are  adjusted  most,  i.e.  the  first 
hidden  neuron  is  the  most  important  one.  After  the  utiUty  of  the  first  hidden  neuron  is  exhausted, 
the  second  neuron  becomes  more  significant,  etc.  If  for  instance  the  network  needs  m  hidden  units  for 
training,  the  hidden  neurons  after  the  mth  one  are  adjusted  with  a  very  small  variance.  Therefore, 
these  hidden  neurons  can  be  removed  from  the  network  without  affecting  the  performance  significantly. 
The  last  hidden  neuron  is  the  least  important  one. 

The  activation  function  for  all  the  neurons  is  selected  as  f{x)  =  -  1.  Since  f'(x)  =  (1  -|- 

/(*))(!  —  /(x))/2  <  1/2,  it  is  easy  to  prove  that  |/(x)l  <  |x/2|  and  \  f{x  +  Ax)  —  f{x)\  <  |Ax/2l.  For 
the  ith  hidden  neuron,  we  have  |op,j  =  |/(nctp,)|  <  lnetp,i/2  =  j  “'jiOpjl/2  <  Wo//2,  where  Wo  = 
Yij  =  rnaxjOpj.  The  input  from  neuron  i  to  output  neuron  k  is  WikOpi-  If  the  hidden  neuron  i  is 

removed,  the  change  of  the  output  in  neuron  k  is  |Ao,jtl  =  \  f{netpk)  -  f{netpk  -  U7,jtOp,)|  <  |iu,tOp,j/2, 
and  |Ao,7,|  <  \wik\WoIIA  <  (|tu,fcl  -f  Wo)*//16  =  5jjt//16,  where  S,fc  =  (Wo  +  Wiki)  is  the  sum  of  the 
absolute  value  of  the  weights  from  the  input  neurons  to  the  hidden  neuron  and  from  hidden  neuron  to 
the  output  neuron.  If  /  =  l,|Ao,fc|  <  5?^/16.  Therefore,  if  a  hidden  neuron  is  deleted  from  a  trained 
network,  the  change  of  the  output  in  neuron  k  depends  on  Sik-  If  the  network  has  only  one  output 
neuron,  Sik  is  written  as  5,. 

For  pattern  classification  a  trmning  set  is  correctly  classified  if  the  largest  output  error  over  the 
entire  set  is  less  than  one.  Let  /?  designate  the  maximum  error  between  the  output  and  target  output 
for  the  trained  network  over  the  entire  sample  space.  The  hidden  neuron  i  is  deleted  from  the  network 
if  Si  <  4v^l  —  S-  If  the  hidden  neuron  i  is  one  of  the  neurons  that  can  not  be  removed  from  the 
network,  regardless  of  the  learning  rate  and  the  initial  weights,  then  we  have  Si  >  4y/l  — 

3  An  Adaptive  Structure  Neural  Network 

An  adaptive  structure  neural  network  is  achieved  through  adjusting  the  number  of  neurons  in  the 
hidden  layers  dynamically.  By  using  the  IB-P  learning  algorithm,  the  hidden  neurons  have  different 
variances.  If  a  hidden  neuron  has  small  Sik  relative  to  the  other  neurons,  it  can  be  removed  from  the 
network  dynamically.  For  pattern  classification  problems,  if  defined  as  the  maximum  error  between 
the  desired  output  and  the  output  of  the  network  over  the  ciitire  training  set  is  less  th*in  1,  then  the 
result  is  considered  as  correct  classification.  The  maximum  error  introduced  by  removing  neuron  i 
from  the  network  is  e^i  <  Sf/16.  After  node  i  is  removed,  er,  the  maximum  difference  between  the 
desired  output  and  network  output,  becomes  ej-,  <  +  Cn  <  Cm  +  Sf/IQ.  If  5,  <  4v/l  -  e„,,  then 

^ri  +  (1  ^m)  —  1-  Therefore,  if  5,  <  4y/l  -  e„ ,  node  i  can  be  removed  and  the  network  can 

still  provide  a  correct  classification. 
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For  a  function  approximation  problem,  a  positive  number  1/  which  is  larger  than  the  expected  error 
is  selected.  If  the  actual  error  of  the  network  is  less  than  1/  and  5,  <  Ay/u  -  Cml'/l-,  we  can  remove 
the  hidden  neuron  i  from  the  network.  The  maximum  error  introduced  by  removing  the  hidden  neuron 
i  from  the  network  is  e,;  <  5f//16  =  v  —  Cm-  After  neuron  i  is  removed,  the  maximum  output  error 
is  ext  <  Cm  +  Cri  <  €m  +  V  -  Cm  =  So  even  if  the  hidden  neuron  i  is  removed  the  actual  output 
error  remains  less  than  u. 


4  Experimental  Results 

4.1  XOR  and  Parity  Problem 

As  a  benchmark  example  we  solve  XOR  problem  to  test  our  algorithm.  It  is  well  known  that  a 
network  with  two  hidden  nodes  can  correctly  solve  the  problem.  To  train  our  network  we  start  with 
4  hidden  units  using  the  improved  B-P  algorithm  with  rj  =  0.5  and  a  =  5.0.  The  following  results 
are  obtained.  Figure  1  shows  the  change  of  Si-  After  the  network  is  trained  5200  steps,  we  have 
So  =  14.1, 5i  =  13.1,52  =  1.47,  and  S3  =  0.038.  According  to  (7),  if  we  remove  hidden  unit  2  we  get 
IA02I  <  1.47^/16  =  0.14.  Similarly  if  we  remove  hidden  unit  3  we  get  IA03I  <  0.00006.  Therefore, 
hidden  unit  3  contributes  very  little  to  the  outputs  of  the  network.  Consequently  the  network  gives  a 
correct  classification  with  hidden  neuron  2  deleted. 

For  parity  problem  with  3  inputs,  if  the  network  has  3  hidden  neurons  we  found  4  cases  out  of 
300  cases  of  getting  trapped  in  local  minima  by  using  the  original  B-P  algorithm  with  random  initial 
weights.  All  300  learning  cases  are  convergent  using  improved  learning  algorithm  with  o  =  5.  For 
parity  problem  with  4  inputs  and  with  5  hidden  units  after  the  network  has  been  trained  with  a  =  5.0 
and  Tj  =  0.5,  the  sum  of  the  absolute  values  of  the  weights  is  given  in  the  following  table  for  8  learning 
procedures  with  random  initial  weights.  We  can  see  that  there  is  large  difference  between  the  variance 
of  the  necessary  hidden  units  and  the  variance  of  the  unnecessary  hidden  units,  while  there  is  no  big 
difference  among  the  variance  of  necessary  hidden  units. 


5o 

5i 

52 

S3 

54 

1 

mumimM 

7.223655 

0.294625 

0.054301 

2 

9.609260 

8.124382 

5.240492 

0.044274 

0.041409 

3 

9.338788 

7.360354 

4.720688 

0.092741 

0.032782 

4 

9.338788 

7.360354 

4.720688 

0.092741 

0.032782 

5 

9.819942 

8.174868 

5.250357 

0.132380 

0.025526 

6 

12.271564 

11.087083 

7.780679 

6.153959 

0.140350 

7 

8.238710 

7.184856 

5.073183 

0.089974 

0.025867 

8 

10.716545 

8.312218 

5.125797 

0.478817 

0.046587 

Figure  2  shows  the  change  of  the  weights  if  we  train  a  network  with  four  hidden  neurons  and  o  =  5 
for  a  four  input  parity  problem  using  dynamic  structure  method.  We  can  see  that  hidden  neurons  3 
and  4  are  removed  within  100  steps. 

4.2  Function  Approximation 

A  neural  network  can  also  be  used  to  approximate  a  function.  As  an  example  consider  the  function 
/(i)  =  sin{2irx)cos{6irx)/3  +  2‘kx /9.  When  we  use  a  netw'ork  with  9  hidden  neurons  to  approximate 
this  function  after  45000  training  steps  with  B-P  algorithm  the  output  function  of  the  network  is 
shown  in  Figure  3(a).  The  sum  of  the  absolute  value  of  weights  5,  after  the  network  is  trained  is  given 
by  So  =  31.47, 5x  =  3.56,52  =  5.15,53  =  18.11,54  =  4.98,55  =  5.84,56  =  6.00,57  =  5.30,58  =  5.21. 
Figures  3  (b),  (c)  and  (d)  show  the  outputs  of  network  with  node  (1),  (3)  and  (5)  removed,  respectively. 
When  IB-P  method  is  used  to  train  the  network  with  9  hidden  neurons,  with  77  =  0.5  and  a  =  2.0, 
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the  results  are  shown  in  Figure  4.  The  sum  of  the  absolute  value  of  the  weights  5,  after  the  network 
is  trained  is  given  by  So  =  8.11, 5i  =  18.79,52  =  27.54,53  =  5.79,54  =  4.07,55  =  0.083,, 56  = 
0.093,  S-  =  0.0092,  Sg  =  0.0035.  We  can  see  that  the  last  4  hidden  units  have  a  very  little  contribution 
to  the  output  and  therefore  can  be  deleted  from  the  network.  Figure  4  (a)  shows  the  response  of  the 
trained  network  and  the  desired  output.  Figure  4(b)  shows  the  output  of  the  network  with  the  last 
four  hidden  units  removed.  If  we  use  the  adaptive  structure  algorithm  that  removes  the  hidden  units 
dynamically  we  get  the  following  results.  The  network  selected  has  8  hidden  units  at  the  beginning 
of  training  with  i/  set  to  0.1.  Figure  5(a)  shows  the  change  of  the  sum  of  absolute  value  of  weights. 
Figure  5(b)  shows  the  network  output  before  the  structure  is  changed  at  about  44000  steps.  Figure 
5(c)  shows  the  network  output  immediately  after  the  last  3  neurons  are  removed.  Figure  5(d)  is  the 
result  of  the  final  network  with  5  hidden  neurons  at  the  end  of  training  cycle. 

5  Conclusion 

The  algorithm  proposed  in  this  paper  improves  some  problems  of  the  back-propagation  algorithm. 
After  a  network  is  trained,  we  clearly  see  which  and  how  many  hidden  units  are  needed  to  solve  the 
^ven  problem.  According  to  the  analysis  given  in  this  paper  ,  we  can  remove  the  hidden  neurons  from 
the  network  both  staticly  and  dynamically.  The  algorithms  proposed  in  the  paper  are  applicable  for 
solving  problems  such  as  pattern  recognition,  function  approximation,  image  compression,  etc. 
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Figure  1:  The  sum  of  the  absolute 
values  of  weights 


Figure  2:  The  sum  of  the  abso¬ 
lute  values  of  weights  using  adap¬ 
tive  structure  scheme 


Figure  3:  The  output  of  the  network  using  B-P  training  algorithm 
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Abstract 

Neural  network  (NN)  based  modeling  often  requires  trying  multiple  networks  with 
different  architectures  and  training  parameters  in  order  to  achieve  an  acceptable  model 
accuracy.  Typically,  only  one  of  the  trained  networks  is  selected  as  "best”  and  the  rest 
are  discarded. 

We  propose  using  optimal  linear  combinations  (OLCs)  of  the  corresponding  outputs 
of  a  set  of  NNs  as  an  alternative  to  using  a  single  network.  Modeling  accuracy  is 
measured  by  mean  squared  error  (MSE)  with  respect  to  the  distribution  of  random 
inputs.  Optimality  is  defined  by  minimizing  the  MSE,  with  the  resultant  combination 
referred  to  as  MSE-OLC. 

We  formulate  the  MSE-OLC  problem  for  trained  NNs  and  derive  two  closed-form 
expressions  for  the  optimal  combination-weights.  An  example  that  illustrates  signif¬ 
icant  improvement  in  model  accuracy  as  a  result  of  using  MSE-OLCs  of  the  trained 
networks  is  included. 
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Abstract 

We  derive  a  simple  constnictive  algorithm  for  determining  hidden  layer  weights  in  a  single  hidden 
layer  neural  network  based  on  an  algebraic  condition  necessary  for  the  existence  of  the  ontpnt  layer 
weights.  The  algorithm  adds  nodes  iteratively — performing  a  simple  optimization  with  the  addition  of 
each  node — until  the  algebraic  condition  is  met.  Consequently,  the  difficult  problem  of  specifying  the 
number  of  hidden  umts  a  priori  is  eliminated.  The  optimization  of  each  node  is  only  as  computationally 
taxing  as  the  simplest  forms  of  the  Hebb  rule  and  hence  should  enable  fast  training  of  networks  with  this 
architecture. 


The  Problem 

Let  the  class  of  single  hidden  layer  feedforward  networks  mapping  all  x  in  SH™  into  be  defined  by 

V>n(x;  B)  =  ^  (1) 

where  n  is  the  integer  number  of  hidden  units,  Wj  in  ia  the  weight  vector  from  the  input  units  to  the 
hidden  unit,  Vj  in  %  is  the  weight  from  the  hidden  unit  to  the  output  node,  0  =:  (vi, . . .  ,Vn,Wt, .  ■ .  ,w„), 
X  —  (l,x)^  is  the  augmented  input  vector,  and  (r:%  — »  is  the  hidden  unit  transfer  function  which  is 
continuous  and  always  increasing(i.e.,  o^(-)  >  0).  In  particular,  we  ate  concerned  with  finding  the  network 
parameters  0  and  n,  such  that  this  network  resizes  a  given  input-output  map.  Which  is  to  say,  assume  we 
are  ipven  a  set  of  input-output  pairs  T  =  {(xt,zi),(x3,Z}), . . .  ,(xjv,zjv)}>  typically  called  the  training  set, 
then  we  want  to  find  a  0  and  an  ri  such  that  V’A(xt;0)  =  zt  for  all  k  =  1, . . . ,  M. 

To  do  this,  one  typically  constructs  a  cost  fiinctimi  of  the  form  E(0)  =  j  ~  V’a(x»;S))*,  where 

ri  is  determined  a  priori,  and  attempts  to  minimize  it  with  respect  to  0  using  a  gradient  descent.  This  type 
of  optimization  has  been  made  popular  by  the  back-pr(^agation  algorithm(8ee  Rumelhart  et  at.  (1986)}. 
Unfortunately,  there  are  many  problems  associated  with  this  technique,  a  few  of  which  are:  local  minima 
on  the  surface  of  E(0)  can  cause  the  gradient  search  can  get  “stuck”;  updating  the  hidden  layer  weights 
necessitates  summations  over  the  ail  the  output  units  weights  and  is  thus  computationally  taxing;  and 
determining  ri  a  priori  is  a  difficult  problem.  Hence  we  present  an  alternative  ^proach. 
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To  begin,  consider  the  N  xm  matrix  X,  whose  row  vectors  2ire  given  by  the  N  augmented  input  training 
patterns  xi.xj, . . .  ,xjv,  snd  the  N  x  n  matrix  Y  with  column  vectors 

Yj  =  tr{Xvrj),  (2) 

where  j  =  1, . . .  ,n  and  — ►  8?^  is  given  by  <r(a)  =  (<r(ai),<r(a2), . . . ,<r(ayv))^.  Notice  that  the  row 

vector  of  Y  represents  the  states  of  the  hidden  units  on  the  ib*^  training  pattern.  Furthermore,  note  that 
each  column  vector  yj  is  in  the  image  of  L[X]  under  o',  where  L\A\  is  the  lineeu  subspace  spanned  by  the 
column  vectors  of  A.  For  sake  of  brevity,  we  will  thus  think  of  each  as  an  element  drawn  from  the  set 
E[X]  =  o'(L[X])/{0}  rather  than  specifying  it  by  a  particular  Wj.  We  have  removed  the  zero  vector  from 
S[X]  because,  as  we  will  see  latter,  we  want  to  exclude  the  trivial  solution. 

FVom  equation  1,  the  output  of  the  network  V'(xi)  is  s  linear  sum  of  the  states  of  the  hidden  units  on  the 
training  pattern  weighted  with  the  parameters  ■  -  t^n — which  is  to  say,  the  N  dimensional  vector 
of  outputs  over  the  training  set  is  given  by  (V’(xi), ^’(*2).  •  -  •  iV’(*Jv))^  =  Y\,  where  v  is  in  3J".  Hence  a 
network  can  realize  the  training  set  when  there  exists  a  solution  v  such  that 

yv  =  z,  (3) 

where  z  =  (zi,Z2> ■  ■  •  From  linear  algebra,  v  exists  if  auid  only  if  z  is  in  the  subspace  spanned  by 

the  column  vectors  of  Y .  Therefore  we  wish  to  choose  a  solution  set  yi,y2,  •  •  •  ,yn  from  E[X]  which  spans 
a  linear  subspace  containing  z.  An  efficient  method  of  choosing  such  a  solution  is  given  herein.  Similar 
methods  are  given  in  Barmann  and  Biegler-K6nig(1992)  and  Fujita(1992). 

Projection  Operators 

We  will  begin  by  introducing  some  of  the  tools  necessary  for  our  construction  and  emalysis.  In  particular, 
we  are  interested  in  the  projection  operator^. 

Definition  1  An  n  x  n  real  mairix  P  is  a  projection  operator  if  and  only  if  P  is  symmetric,  (P^  —  P)> 

P  is  idempotent,  (P^  =  P). 

Furthermore,  the  n  x  n  projection  matrix^  P  projects  onto  a  subspace  5  of  .  Hence  for  a  in  S" ,  Pa  =  a 
if  and  only  if  a  is  in  S. 

Definition  2  If  the  n  x  n  projection  matrix  P  projects  onto  the  subspace  S  in  Sf” ,  then  the  n  x  n  matrix 
Pf~  In  —  P,  where  f„  is  the  n  x  n  identity  matrix,  projects  onto  the  orthogonal  complement  of  S. 

Hence  for  a  in  8i",  Pea  =  a  if  and  only  if  a  is  in  S^. 

We  now  define  the  map  pn'.^  — ►  8i"  x  3?"  such  that  Pn(a)  =  a(a^a)~*a^,  for  all  a  in  3i”.  Since  Pn(a) 
is  both  symmetric  and  idempotent  it  is  a  projection  matrix.  We  also  note  that  if  some  vector  b  in  8^"  is  a 
scalar  multiple  of  a,  i.e.,  b  =  aa  for  some  o  G  8i,  then  p(a)b  =  ap(a)a  =  aa(a^a)~*a^a  =  oa  =  b,  and 
hence  pn(&)  projects  onto  the  one  dimensionaJ  subspatce  spanned  by  the  vector  a.  Now  suppose  we  are  given 
an  n  X  n  matrix  P  which  projects  onto  the  subspace  spanned  by  the  vectors  ai,a2, . . .  ,am  ii.  Then, 

thorough  treatment  of  projection  operators  is  given  in  S.  Roman  1992. 

^We  will  use  the  terms  projection  operator  and  projection  matrix  interchangeably. 
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for  any  vector  b  in  ,  the  augmented  projection  matrix  P'  which  projects  onto  the  space  spanned  by  the 
vectors  m , aj, . . . , , b  is  given  by 

^  =  P -k- Pn(Pch).  (4) 

This  result  is  proven  in  the  appendix  of  C  .  Fujita  1992.  We  now  use  this  relation  to  define  a  projection 
operator  recursively  based  on  some  initial  “reference  vector” .  Which  is  to  say,  given  a  sequence  of  vectors 
{ai,a2, . . .}  in  Si"  and  a  reference  vector  r  in  Si",  let  the  recursive  sequence  {P\  ...}  of  n  x  n  projection 

matrices  be  defined  by  the  relation  =  P^  +  p„(P^aj),  where  P^  =  Pn(r)-  Using  equation  4,  one  can 
easily  verify  by  induction  that  P'^"'’^  projects  onto  the  subspace  spanned  by  the  vector  r, ai.aj, . . .  ,aj  . 

A  Construction 

We  now  construct  the  set  of  linearly  independent  vectors  y i ,  y? >  ■  •  • ,  yn ,  where  n  <  N,  which  span  a  subspace 
that  includes  as.  As  above,  define  the  recursive  sequence  {P*,P^, . . .}  of  JV  x  AT  projection  matrices  by  the 
relation  P^+*  =  P^  +  pN(Piyj),  where  z  is  our  reference  vector(i.e.,  P^  =  p/v(z)),  y,  is  drawn  from  the 
set  ]S[A’]  such  that  it  is  not  in  the  subspace  spanned  by  the  vectors  yi.ya,  •  •  ■  and  z  is  defined  as  in 

equation  3.  We  then  have  that  P^+^  projects  onto  the  subspace  spanned  by  the  vectors  z,yi, . . .  ,yj  for  all 
i  =  l,2,.... 

Theorem  1  Then  exists  a  positive  integern  <  N  such  that  P"yn  =  yn»  v)here  P"  is  given  by  the  recurrence 
relation  above.  Furthermore,  z  is  in  the  space  spanned  by  the  vectors  yi.yz,  •  ■  •  ,yn- 

Proof.  Since  P^  projects  onto  the  space  spanned  by  the  vectors  z,yi, . . . ,yj_i,  if  P^yj  ^  y>  then,  by 
induction  on  j,  the  vectors  z,  yj ,  • . . , y^  are  linearly  independent.  Now  suppose  that  for  all  j  =  1 , 2, . . . ,  N , 
plyj  ^yj.  Then  the  set  of  lV+1  vectors  z,yi, ...  ,yAr  in  are  linearly  independent.  Yet  the  largest  number 
of  linearly  independent  vectors  in  is  N  and  hence  there  must  be  some  n  <  N  such  that  P"yn  =  yn- 
Furthermore,  this  implies  that  y„  is  in  the  subspace  spanned  by  the  vectors  z,yi, . . .  ,yn-i  and  thus  there 
must  exist  scalars  Oj,  j  =  0, 1, . . .  ,n  -  1  such  that  y„  =  otoz  +  If  “o  =  0  then  obviously  y„ 

is  in  the  subspace  spanned  by  the  vectors  yi,y2i  ..,yj-i  which  contradicts  our  assumption.  Hence  letting 
a„  =  —  1,  we  have 


n 


and  thus  z  is  the  space  spanned  by  the  vectors  yi.yj, . . .  ,yn- 


□ 


We  now  need  a  method  of  choosing  each  fj  so  that  the  following  conditions  hold; 

1.  yj  is  not  in  the  subspace  spanned  by  yi,y2,  -  •  •  .yj-i- 

2.  P^Vj  =  yj  if  such  a  y^  in  S[X]  exists. 

Condition  1  ensures  that  theorem  1  holds  and  condition  2  ensures  a  minimal  n  in  the  sense  that  if  there 
is  a  vector  in  E[X]  which  satisfies  the  theorem  then  it  will  be  chosen  over  any  other  vector.  Note  that  if 
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The  Subepace  Projection  Algorithm. 

Given  a  set  (xi,  21),  (xs,  23).  ■  ■  .,{xn,zn),  of  training  patterns,  let  the 
row  vector  of  the  N  x  m  matrix  X  be  (l.Xj),  k  =  I, . . .  ,N .  Fur- 
thermcMre  let  a  =  izi,Z2,  --,ZNV  and  yj  =  o-(Xwj)  for  all  j  =  1,2, . . 
where  Wj  is  in  and  a  is  defined  as  in  equation  2.  Let  the  N  x  N 
projection  matrices  P*  and  be  defined  as  psi*)  and  In  -  P*  respec¬ 
tively,  where  pn{b.)  =  a(a*’a)“*a^  for  all  a  in 

Fori  =  1,2,...; 

Let  Wj  =  P^Xj  for  all  t  =  1, . . . ,  n. 

While  jllPe  yjll^  is  not  minimum; 

For  i  =  l,...m; 

Awij  =  -tfyjxi. 

Wij  =  Wij  ■¥  Awij . 

End  For. 

End  While. 

Letpi+»=Pi-hpjv(Piyi). 

Let  Pi+\= /n  -  P^+^ 

End  For  when  P^yj  =  yj. 


Figure  1;  An  algorithm  for  finding  the  hidden  layer  weights  in  a  single  hidden  layer  neural  network. 


P“yn  =  yn  for  scHne  n  then  P^yn  =  0  and  hence  condition  2  holds  if  for  each  j  —  1,2,...,  we  choose  y^ 
such  that 

ll^iyill  =  y^^j{l|i»iy|l}-  (5) 

This  condition  can  be  sq>proximated  by  a  gradient  descent  algorithm.  To  this  end,  consider  the  cost  function 

=  (6) 

where  Wj  is  the  weight  vector  of  the  j*'*  hidden  unit  and  yj  is  defined  in  equation  2.  Equation  6  can  be 
optimized  by  iteratively  changing  the  components  of  wj  towards  the  direction  of  the  negative  gradient  of 
E(yrj).  To  do  this,  we  change  Wij  for  all  t  =  1, ...  ,m  by  Awij  according  to 

Awij  =  -rfyjwij,  (7) 

where  q  in  is  the  step  size,  wy  =  P}Xi,  and  x,-  is  the  P*  column  vector  of  X. 

It  should  be  noted  that  equation  7  is,  as  learning  rules  go,  very  simple — i.e.,  it  is  no  more  complicated 
than  the  simplest  forms  of  the  Hebb  rule.  To  see  this,  note  that  wy  is  constant  with  respect  to  Wj  and  hence 
for  each  wj  the  rule  involves  only  a  linear  sum  of  the  states  of  the  hidden  units  on  each  training  pattern 
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weighted  with  the  elements  of  -Kij .  The  only  computationally  taxing  put  of  the  construction  are  done  when 
a  new  hidden  unit  is  added.  The  complete  algorithm  is  given  in  figure  1.  Unfortunately,  this  optimization 
does  not  ensure  that  condition  1  holds.  Yet  simulations  have  shown  that  if  we  initialize  the  weights  to  large 
random  values  and  let  (r(-)  =  tanh{-),  then  the  gradient  de^^ent  defined  in  equation  7  will  find  a  yj  which 
‘iills  out”  the  subspace  defined  by  .  Hence  yj  should  noi  be  completely  contained  in  any  subspace  of 
span{s,  y  1 , . . . ,  y>- 1 } — or  equivently,  not  in  the  subspace  spanned  by  the  vectors  y i ,  y2 >  •  •  ■ .  yj  - 1  • 
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Abstract 

Although  several  new  learning  rules  have  been  proposed,  the  space  aind  time  complexities  of  these  two 
bottlenecks  remain.  The  bottlenecks  occur  because  all  the  existing  learning  techniques  are  based  on  a  single 
large  neural  structure.  In  this  paper,  we  approach  the  solution  to  this  problem  by  analyzing  the  input  vector 
space  2Uid,  then,  partition  this  space  into  subspaces.  Each  subspace  will  be  learned  by  a  small  neural  network 
with  a  simple  learning  rule.  We  also  speed  up  the  learning  process  by  reducing  the  number  of  training  vectors 
down  to  0(mn)  instead  of  2" ,  where  m  is  the  number  of  vectors  having  output  targets  equal  to  ones  and  n 
is  the  number  of  input  bits.  The  reduction  is  based  on  the  concept  of  guard  ring  vectors.  The  experimental 
results  show  that  our  learning  can  be  speeded  up  by  2-30  times  over  the  non-partition  process.  However  the 
number  of  neurons  used  in  our  approach  is  uncontrollable.  In  some  cases,  the  number  is  reduced  but  in  some 
cases  it  is  increased. 


1  Introduction 

In  any  supervised  neural  network  model  such  as  backpropagation  [2],  there  are  two  major  bottleneck  prob¬ 
lems;  perfect  recognition  and  convergence  rate.  The  perfect  recognition  problem  concerns  two  main  factors. 
One  factor  is  the  generalization  of  the  training  set.  The  training  set  and  the  actual  data  set  must  be  very 
similar  within  a  small  variance  range.  If  the  variance  between  these  sets  is  high  the  chance  that  an  input 
vector  will  be  misrecognized  is  also  high.  Currently,  for  a  binary  space,  the  size  of  a  training  is  in  the  order 
of  2”  where  n  is  the  size  of  each  input  vector.  For  a  training  set,  it  composes  of  two  subsets.  The  first  subset 
consists  of  all  vectors  whose  target  outputs  are  equal  to  ones  while  the  second  subset  consists  of  all  vectors 
whose  target  outputs  are  equal  to  zeros.  The  training  time  complexity  will  possibly  be  exponential  in  some 
cases.  In  this  paper  we  reduce  this  time  complexity  down  to  polynomial  time  by  using  guard  ring  concept. 
Another  factor  concerning  the  perfect  recognition  is  the  structure  of  the  network.  Generally,  a  network  is 
composed  of  layers  of  neurons.  The  recognition  capability  of  a  network  depends  upon  this  structure  and  the 
number  of  neurons  in  each  layer  [5].  The  convergence  rate  problem  concerns  the  learning  rule  of  the  network. 
Several  learning  rules  have  been  proposed.  All  the  classical  rules  are  summarized  in  [6].  Estimation  of  the 
feasible  required  number  of  neurons  in  the  hidden  layer  are  in  [3,5].  If  the  number  of  the  hidden  neurons 
is  too  few  the  network  can  partially  recognize  the  input  vectors.  On  the  contrary,  if  the  number  of  the 
hidden  neurons  is  over  the  actual  requirement,  it  will  create  redundancy  problem.  These  redundant  neurons 
when  implemented  on  a  chip  will  consume  power.  To  overcome  these  bottleneck  problems  the  whole  network 
should  be  partitioned  into  subnetworks  by  analyzing  the  input  vector  space.  For  each  subnetworks  only  a 
simple  learning  rule  is  employed.  The  contributions  of  this  paper  are:  1.  concept  of  0-1  surface  projection,  2. 
input  space  slicing  algorithm,  3.  guard  ring  pattern  concept,  and  4.  maximum  subnetwork  sharing  concept. 

The  paper  consists  of  eight  sections.  Section  2  discusses  the  analysis  concept  of  parallel  training.  Section 
3  discusses  the  input  vector  slicing  algorithm.  Section  4  discusses  and  provides  the  bound  on  the  number  of 

*This  work  is  partially  supported  by  Department  of  Mathematics,  Chulalongkorn  University,  Bangkok,  Thailand  and  by 
National  Science  Foundation,  USA  under  grant  NSF-ADP-04 


guard  ring  patterns.  The  technique  to  combine  all  the  subnetworks  is  given  in  Section  5.  Section  6  discusses 
the  multiple  output  case.  Section  7  gives  the  experimental  results.  Section  8  concludes  the  paper. 


2  Analysis  Concept  of  Parallel  Training 

One  solution  to  these  bottleneck  problems  previously  mentioned  is  by  analyzing  the  input  vector  space 
and,  then,  slicing  this  space  into  many  subspaces.  Then,  each  subspace  will  be  learned  by  each  separated 
subnetwork  concurrently.  The  space  that  we  study  in  this  paper  will  be  a  binary  space.  The  definition  of 
these  two  spaces  and  the  distance  metrices  are  as  follows. 

Definition  1  A  binary  space  of  dimension  m  is  a  space  consisting  of  a  set  of  vectors.  Each  vector  V  ^ 
(wi,  V2, . . . ,  Vm)  and  for  each  bit  «,  €  {0,1}. 

Definition  2  A  bit  distance  B(Vi,Vj)  between  binary  vectors  Vi  and  Vj  is  equal  to  the  number  of  bit  pairs 
suck  that  the  a**  bit  pairs  Vi,a  7^  for  all  a’s  and  Vi^o  €  K  and  Vj  o  G  Vj. 

Definition  3  Binary  vectors  V  and  W  are  adjacent  if  B(V,W)  =  1. 

Definition  4  A  Eucledian  distance  D(Vi,Vj)  between  real  vectors  Vi  and  Vj  is  equal  to  ||  Vi  —  Vj  ||. 

The  size  of  the  binary  space  is  equal  to  2*”  and  all  possible  vectors  in  the  space  form  a  hyperspace  cube. 
Each  V  is  located  at  the  corner  of  this  cube.  The  bit  distance  between  any  two  adjacent  vectors  is  equal  to 
one  and  it  is  also  equal  to  the  Eucledian  distance. 

2.1  Separability  in  Binary  Space 

We  consider  this  situation.  Given  a  binary  space  5  whose  vectors  belong  to  either  class  A  or  class  B.  We 
want  to  find  the  conditions  that  guarantee  the  separability  of  S  into  two  classes  A  and  S  in  n  dimensions 
by  a  hyperplane,  n  can  be  any  value.  The  problem  of  separability  has  been  reported  in  [1,7,8].  The  PAC 
learning  model  and  developed  an  algorithm  to  learn  n  input  vectors  under  given  k  hidden  threshold  units 
is  considered  in  [Ij.  These  n  vectors  are  learned  by  a  single  network  consisting  of  k  hidden  threshold  units. 
The  time  complexity  of  this  technique  is  0(fcn*  +  +  k^e~^).  A  sufficient  condition  that  a  set  of 

regions  can  be  separated  by  a  2-layer  feed  forwud  network  using  threshold  units  is  shown  in  [7].  The 
relation  among  the  number  of  hidden  layer  nodes,  the  complexity  of  a  multiclass  discrimination  problem, 
and  the  number  of  input  vector  needed  for  a  good  learning  are  summarized  in  [8].  They  did  not  consider  the 
location  distribution  of  each  vector.  Unlike  these  proposed  concepts  and  2tlgorithms  we  analyze  the  grouping 
characteristics  of  the  input  vectors  and  slice  them  into  minimum  number  of  subgroups.  Each  subgroup  must 
be  perfectly  separated  from  the  others.  To  achieve  the  minimum  number  of  subgroups,  some  subgroups  must 
be  combined  into  one  subgroup  to  preserve  the  condition  that  the  new  group  is  separable  from  the  others. 
Before  discussing  the  technique  for  slicing  a  group  into  subgroups  we  will  consider  the  conditions  when  two 
groups  in  n  dimensional  space  can  be  separated.  These  conditions  which  are  different  from  those  in  [9]  are 
summarized  as  follows. 

Lemma  1  Given  a  space  S  consists  of  two  groups,  A  and  B.  The  membei  of  these  two  groups  are  randomly 
scattered.  If  there  exists  a  hyperplane,  H,  passing  through  S  and  the  projection  of  the  vectors  in  S  onto  tkzs 
hyperplane  creates  a  group  vector  of  either  ABA  or  BAB  pattern  then  A  and  B  cannot  be  separated  by  any 
hyperplane. 

Definition  5  Pi{V)  :  B"  —*  B”  is  a  surface- 1  projection  on  the  i"*  dimension  of  a  vector  V  =  fvi,V2, . . . ,  v„) 
if  Vi  =  1. 
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Definition  6  P°(V')  :  B”  — ►  fl"  is  a  snrface-0  projeciton  on  the  i‘*  dimension  of  a  vector  V  =  (v\,V2 . v„J 

if  Vi  =  0. 

From  the  definitions  of  projections  above,  it  is  obvious  that  adjacent  to  V  if  the  t'^  bit  of  V  is 

1  and  Pi{V)  is  adjacent  to  V  if  the  bit  of  V  is  0.  Given  a  binary  space  S.  Let  {Vi}  be  a  set  of  vectors 
in  group  A  and  the  rest  in  group  S  -  A.  Suppose  Pj*{V}}  and  are  applied  to  every  Vj  in  A.  PfiVj } 

and  may  produce  some  new  vectors  not  originally  in  A  which  implies  that  these  new  vectors  must 

be  in  S  —  A.  We  conclude  following  results. 

Theorem  1  In  the  t**  dimension,  if  both  Pi{Vj}  and  P^{Vj},  where  Vj  £  A,  produce  some  new  vectors  not 
originally  in  A  then  A  is  inseparable  from  S  —  A  by  a  hyperplane  in  the  dimension. 

Proof  Consider  Pi{Vj}  first.  If  it  generates  some  new  vectors  in  —A  it  means  that  there  must  be  some 
vectors  in  5  —  A  adjacent  to  those  vectors  in  A  whose  bits  are  changed  from  0  to  1  by  projection.  Similarly, 
implies  that  there  must  be  some  vectors  in  5  —  A  adjacent  to  those  vectors  in  A  whose  bits  are 
changed  from  1  to  0  by  projection.  Therefore  A  is  sandwiched  by  some  vectors  in  5  —  A.  By  Lemma  1,  A 
and  5  —  A  are  inseparable  by  a  hyperplane.  □. 

Theorem  2  In  the  i‘*  dimension,  if  only  either  PiiVj)  or  Pi’{Vj},  where  Vj  £  A,  produces  some  new 
vectors  not  originally  in  A  then  A  is  separable  from  S  —  A  by  a  hyperplane  in  the  i*’'  dimension. 

Proof  If  either  }  produces  some  vectors  in  A  and  S  — A  it  implies  that  A  is  not  sandwiched 

by  any  vectors  in  S  —  A.  Thus  A  and  S  —  A  are  separable  in  the  i'^dimension.  □ 

Theorem  3  A  is  separable  from  S  —  A  by  a  hyperplane  if  for  every  i***  dimension  only  either  Pi{Vj}  or 
Pi^{Vj},  where  Vj  £  A,  produces  some  new  vectors  not  originally  in  A. 

Proof  The  result  follows  directly  from  Theorems  1  and  2.  If  in  some  dimension  i  some  vectors  of  A  are 
sandwiched  by  some  vectors  of  S  —  A  then  it  is  obvious  that  these  vectors  of  A  are  inseparable  from  5  —  A 
by  a  hyperplane.  □ 

Theorem  4  (Generalized  exclusive- OR).  If  every  pair  Vi  and  Vj  of  class  A  has  B{Vi,Vj)  >  2  then  class  A 
is  inseparable  from  class  B. 

Proof  If  every  pair  Vi  nd  Vj  has  B(V<,  V})  >  2  it  means  that  all  paths  that  connect  V  and  Vj  of  class  A 
must  pass  some  adjacent  vectors  V*i  and  Vi.2  in  S  -  A  on  different  paths.  When  Vj  and  Vj  are  projected  on 
both  surfaces  are  executed  it  will  create  a  sandwich  situation.  Therefore  Vj  and  Vj  cannot  be  in  the  same 
group  and  cannot  be  separated  from  5  -  A  by  a  hyperplane.  □ 

The  following  table  illustrates  how  the  surface-0  and  surface- 1  projections  ue  performed.  We  designate 
symbols  a  —  0  and  s  —  1  to  represent  surface-0  and  surface-l  projections,  respectively.  The  bold  vectors  are 
vectors  not  in  the  given  set.  Let  {101,  100,  110,  010,  011}  be  a  given  vector  set.  In  the  first  dimension, 
the  projection  creates  a  new  vector  {111}  only  on  surface- 1.  In  the  second  dimension  the  projection  creates 
two  new  vector  pairs,  namely  {111,000}  and  {111,001}  and  in  the  third  dimension  the  projection  creates 
two  new  vector  pairs,  {001,111}  and  {111,111}.  By  applying  the  above  Theorems  we  can  conclude  that  this 
given  vector  set  cannot  be  separated  from  the  other  vectors  {111,000,001,001}  in  this  3-dimensional  space 
by  using  a  hyperplrme. 


Given 

Dimension  1 

Dimension  2 

Dimension  3 

patterns 

s-0 

5-i 

s-0 

s-1 

s-0 

S-1 

101 

100 

101 

111 

000 

100 

100 

100 

101 

no 

000 

100 

110 

no 

111 

100 

no 

010 

no 

010 

010 

on 

000 

010 

010 

no 

on 

010 

on 

001 

on 

on 

111 
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3  Input  Vector  Set  Slicing  Concept 

If  class  A  cannot  be  separated  by  from  class  S  —  A  we  need  to  slice  class  A  into  subclasses  furlh-'-  such 
that  each  subclass,  Ai,  is  separable  from  S  —  by  a  hyperplane.  The  previous  Theorems  can  be  applied 
to  test  the  separability  of  Ai  from  5  —  Ai.  Class  A  is  sliced  into  many  subclasses  by  conditionally  gi  juping 
each  vector  in  A  into  subclasses.  There  are  two  conditions  that  must  be  considered  during  the  slicing  and 
grouping  process.  The  first  condition  occurs  after  0/1  surface  projection.  In  class  A,  two  vectors  K  and  Vj 
cannot  be  grouped  in  the  same  subclass  if  there  exists  a  sandwich  condition  that  is  P^(Vi)  ^  A  and 
^  A  oi  ^  A  and  P^{Vj)  ^  A  exist.  The  second  condition  concerns  the  transformation  of  the  given 

vector  set  after  the  0/1  surface  projection.  To  group  any  vectors,  K  and  Vj  together,  K  must  be  able  to 
transform  to  itself  of  to  Vj .  To  prevent  the  first  condition  to  occur  each  vector  must  be  able  to  transform 
to  the  other  vectors  within  the  same  group.  For  example,  let  A  —  {101,  111,  110,  001,  000  ,010}.  We  name 
a=101,  b=lll,  c=110,  d=001,  e=000,  f=010.  It  can  be  seen  that  if  vectors  b,  d,  and  /  are  in  the  same  group 
vector  b  cannot  be  transformed  to  vectors  d  and  /.  Similarly,  vector  d  cannot  be  transformed  to  vectors  b 
and  /  and  vector  /  cannot  be  transformed  to  vectors  6  and  d.  We  call  the  first  condition  conflicting  and 
the  second  condition  transformable. 

A  valid  group  occurs  if  there  are  trans formable  and  non  —  conflicting  conditions  for  any  pair  of  vectors 
in  that  group.  Let  {t)i,U2.  •  •  •. be  a  set  of  given  input  vectors.  The  transformability  between  any  two 
vectors  Vi  and  Vj  occurs  when  B(Vi,V^)  =  1. 

Slicing  Algorithm 

1.  For  all  input  vectors,  perform  surface-0  and  surface-1  projections  in  all  dimensions. 

2.  Select  «i.  Let  g  be  the  group  index.  Set  ^  =  1. 

3.  Select  all  Wi’s  such  that  =  1  and  there  is  no  conflict  between  any  u,  and  vi.  Assign  Uj  in  the 

same  group  as  vi. 

4.  Set  g  =  g+1.  Select  a  new  Vi  not  grouped  in  any  group  as  the  first  vector  of  a  new  group  g. 

5.  Select  all  vt  not  grouped  in  any  group  such  that  B(t;i,vj)  =  1  and  there  is  no  conflict  between  any  vt 
and  Vi .  Assign  Vk  in  the  same  group  as  Vi . 

6.  Repeat  steps  4  and  5  until  all  vectors  are  grouped. 

From  the  above  example  {101,  111,  110,  001,  000,  010},  after  the  slicing  algorithm  the  valid  groups  are 
(101,  111,  110}  and  {001,  000,  010}. 

Theorem  5  The  time  complexity  of  ike  slicing  algorithm  is  O(m^),  where  m  is  the  number  of  given  input 
vectors. 

Proof  Consider  the  worst  case.  For  any  vector,  Vi,  being  considered,  we  must  compare  this  vector  with  the 
other  given  vector,  Vj,  if  B(Vi,  Vj)  do  not  conflict.  The  maximum  number  of  V^’s  that  must  be  considered 
is  equal  to  m.  Therefore  the  maximum  number  of  comparison  is  less  than  or  equal  to  which  means  that 
the  time  complexity  is  O(m^).  O 

4  Minimal  Number  of  Guard  Ring  Vectors 

It  is  necessary  to  use  both  vectors  in  class  A  and  class  B  to  train  any  subnetwork.  If  we  use  both  classes 
the  number  of  training  vectors  will  be  equal  to  2"  where  n  is  the  number  of  input  vector  bits.  The  essential 
number  of  additional  vectors  used  to  train  with  the  vectors  in  clews  A  is  0(mn)  where  m  is  the  number  of 
vectors  with  respect  to  nucleus  p  and  n  is  the  number  of  input  bits  is  given  in  [10].  The  additional  vectors 
act  as  guards  protecting  all  vectors  in  class  A  from  those  in  class  B.  These  additional  vectors  are  taken  from 
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class  B.  The  concept  of  guard  ring  vectors  can  be  applied  to  our  case.  The  difference  between  our  case  and 
[10]  is  there  is  no  nucleus  vectors  for  grouping  other  vectors  whose  bit  distances  are  equal  to  one  in  our  case. 
In  our  case  all  vectors  are  grouped  in  the  same  group  if  they  generate  only  surface-0  projection  vectors  or 
surfsce-1  projection  vectors.  Our  approach  is  more  general  than  theirs.  Let  m  be  the  number  of  vectors  in  a 
sepuable  group,  n  the  number  of  bit  in  each  vector,  d  the  number  of  vector  pairs  having  bit  distance  of  one, 
and  6  the  number  of  vectors  having  bit  distance  of  two.  The  number  of  guard  ring  vectors  is  summarized  as 
follows. 

Theorem  6  The  number  of  guard  ring  vectors  is  equal  to  mn  —  2d  —  b 

5  Combining  Subnetworks 

All  subnetworks  must  be  combined  to  form  one  network.  The  combining  technique  is  based  on  the  fact 
that  each  subnetwork  generates  a  single  output  one  at  a  time.  These  outputs  from  subnetworks  must  be 
combined  in  such  a  way  that  when  one  of  the  subnetworks  generates  an  output  the  output  will  appear  at  the 
output  layer  of  the  combined  subnetworks.  Let’s  consider  the  case  when  the  combined  subnetworks  has  only 
one  output  one  output  at  the  output  layer.  Suppose  that  there  are  m  subnetworks.  At  any  time  only  one 
subnetwork  generates  an  output.  Thus  the  possible  output  vectors  of  m  bits  generated  by  these  subnetworks 
are  {100. ..000,  010. ..000,  •  ■  000. ..001}.  We  need  one  neuron  to  learn  these  outputs  and  generate  an  I’s 

when  one  of  them  appears.  To  train  the  output  neuron  to  recognize  these  vectors  may  take  many  epochs. 
The  easier  approach  to  this  training  is  by  considering  the  complement  of  this  situation.  Instead  of  letting 
the  output  neuron  generate  an  I’s  when  it  recognizes  these  output  vectors  we  let  it  generate  an  O’s  and 
let  it  generate  an  I’s  when  it  recognizes  vectors  {000. ..000).  The  output  vectors  {100. ..000,  010. ..000,  •  •  -, 
000. ..001}  will  become  the  guard  ring  vectors  for  vector  {000. ..000}. 

6  Multiple  Outputs 

The  slicing  technique  previously  discussed  can  be  extended  to  multiple  output  case.  The  objective  of  multiple 
output  case  is  to  obtain  the  minimum  number  of  subnetworks  or  the  maximum  sharing  of  subnetworks  for 
all  outputs.  For  an  output,  the  way  to  group  the  input  vectors  with  respect  to  this  output  is  not  unique. 
This  fact  can  be  applied  to  obtain  the  maximum  sharing  of  subnetworks.  Therefore  the  solution  to  achieve 
the  maximum  sharing  is  to  generate  all  possible  slicing  group  and  find  those  common  subgroups. 

7  Experimental  Results 

The  comparison  is  performed  by  using  the  classical  backpropagation  learning  rule  proposed  in  [2].  The 
reason  that  we  did  not  use  any  recent  complex  learning  techniques  such  as  the  one  in  [4]  because  we  want 
to  emphasize  on  the  slicing  algorithm  and  projection  testing  technique  rather  than  the  learning  algorithm 
itself.  We  want  to  signify  the  point  that  another  approach  to  speed  up  the  learning  time  besides  improving 
the  learning  rule  is  by  slicing  the  the  input  vectors  and  using  a  simple  learning  rule  to  learn  these  in  parallel. 
We  compare  two  critical  factors,  the  learning  speed  and  the  area  saved  between  the  unsliced  and  sliced  input 
vector  approaches.  The  total  sum  square  error  lays  inbetween  0.001  to  0.007.  On  the  speed  comparison 
between  sliced  and  unsliced  network  cases,  we  try  to  use  the  minimum  number  of  neurons  in  both  cases. 
For  the  sliced  network  case,  by  the  above  Theorems,  the  minimum  number  of  neurons  is  always  achieved. 
On  the  other  hand,  for  the  unsliced  network  case,  the  fewer  the  number  of  neurons  are  used  the  longer  the 
training  time  is  achieved.  The  meaning  of  each  symbol  in  the  comparison  table  is  as  follows:  Ns  symbolizes 
the  total  minimum  number  of  neurons;  Is  symbolizes  the  size  of  each  input  vector;  Os  symbolizes  the  size 
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of  each  output  vector;  Es  symbolizes  the  number  of  epochs;  SNs  symbolizes  the  number  of  subnetworks; 
N aS  symbolizes  the  number  of  neurons  in  each  subnetworks;  E  —  range  symbolizes  the  range  of  epochs  of  all 
subnetworks;  %Sp  symbolizes  the  speedup  ratio  which  is  equal  to  the  number  of  epochs  by  unsliced  network 
and  the  number  of  epochs  of  sliced  network;  %N  change  symbolizes  the  percent  of  number  of  neurons  change 
between  sliced  and  unsliced  approaches.  The  plus  sign  means  the  number  of  neurons  is  increased  while  the 
minus  sign  means  the  number  of  neurons  is  decreased. 


8  Conclusion 

We  presented  another  approach  to  speed  up  the  learning  process.  Based  on  the  experiments,  the  learning 
process  is  speeded  up  by  2-30  times.  Although  the  partitioning  algorithm  has  a  polynomial  time  the  area 
complexity  is  varied  from  case  to  case  which  is  difficult  to  predict.  Generally,  it  can  be  deduced  from  the 
experiments  that  the  higher  the  speed  the  more  the  neurons  are  required.  The  comparison  speed  in  the 
above  table  may  be  varied  depending  on  several  parameters  such  as  learning  rate,  initial  weight  vectors,  and 
the  steepness  of  the  sigmoidal  function. 
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ABSTRACT  Recently,  some  fast  and  flexible  neural  netwoiics  have  been  proposed  to  attempt  to  solve 
efficiently  hard  optimization  problems.  With  the  intent  to  represent  more  closely  the  major  features  of 
biological  neural  systems  and  to  mimic  their  behavior,  a  neural  network  model,  called  the  Random  Neural 
Network  (RNN),  h^  been  introduced  by  Gelenbe  and  has  been  used  in  solution  of  optimization  and  recognition 
problems.  Lately,  a  supervised  learning  procedure  which  is  mainly  based  on  the  minimization  of  a  quadratic 
error  function,  is  proposed  for  the  recurrent  RNN  model.  In  this  paper  we  explore  the  relationship  between  the 
RNN  model  applied  to  optimization  and  the  network  learning,  specifically  for  acyclic  graph  partitioning 
problem.  This  new  approach  links  these  two  domains  known  as  learning  and  optimization. 

1.  INTRODUCTION 

Many  optimization  problems  become  intractable  when  the  number  of  suboptimal  solutions  grows 
exponentially  with  the  size  of  the  problem.  Such  problems  belong  to  the  class  of  NP-complete  problems,  i.e., 
no  algorithm  is  know  which  provides  an  exact  solution  to  the  problem  in  a  computational  time  which  is  a 
polynomial  in  the  size  of  the  problem  input.  In  the  past,  researchers  have  developed  heuristic  methods  that 
provided  suboptimal  solutions  in  a  time  that  is  proportional  to  a  polynomial  in  the  size  of  the  problem.  But  the 
solutions  provided  by  heuristic  methods  are  often  unacceptable  for  problems  involving  large  size  graphs  which 
are  unfortunately  the  most  frequent  in  practical  applications. 

Recently,  some  fast  and  flexible  neural  networks  have  been  proposed  to  attempt  to  solve  efficiently  hard 
optimization  problems.  With  the  intent  to  represent  more  closely  the  major  features  of  biological  neural 
systems  and  to  mimic  their  behavior,  a  neural  network  model,  called  the  Random  Neural  Network  (RNN),  has 
b^n  introduced  by  Gelenbe  [2,  6]  and  has  been  used  in  solution  optimization  [7]  and  recognition  problems. 
Lately,  a  supervised  learning  procedure  which  is  mainly  based  on  the  minimization  of  a  quadratic  error  function, 
is  proposed  for  die  recurrent  RNN  model  [9].  In  this  paper  we  explore  the  relationship  between  the  RNN  model 
appliM  to  optimization  and  the  network  learning,  specifically  for  acyclic  graph  partitioning  problem.  This  new 
approach  links  these  two  domains  known  as  learning  and  optimization.  The  general  idea  is  that  the  network 
learns  to  minimize  the  cost. 

This  work  is  organized  as  follows.  In  section  2,  we  present  the  partitioning  graph  problem.  Section  3 
presents  the  RNN  of  Gelenbe,  the  basic  learning  algorithm  and  their  application  to  optimization  problem. 
Section  4  compares  the  recurrent  RNN  model  with  other  methods.  Remarl^  concerning  the  future  works  and 
concluding  are  provided  in  section  5. 

2.  DEFINITION  OF  GRAPH  PARTITIONING  PROBLEM 

The  problem  consists  in  dividing  a  graph  in  several  subgraphs,  so  as  to  minimize  the  costs  of  connection 
between  them.  The  idea  is  to  divide  the  nodes  of  the  graph  in  several  distinct  subsets  so  as  to  minimize  the 
links  between  the  subsets,  that  is  the  sum  of  the  arcs  whose  joined  nodes  are  in  different  subset  is  minimal.  The 
graphs  are  acyclic  and  directed.  We  can  complicate  the  problem  with  a  weight  for  the  arcs.  In  this  case,  we  must 
minimize  the  sum  of  the  weights  between  the  subsets.  Also,  we  can  add  a  weight  to  the  nodes  and  define  again 
what  we  want  to  minimize  according  to  the  particular  characteristics  of  problem  in  study.  We  have  an  example 
of  a  particular  definition  of  graph  in  [7],  for  the  problem  of  parallel  program  partitioning. 

In  a  very  general  way,  to  place  the  problem  on  a  madiematical  formulation,  the  following  definition  is 
necessary;  the  graphs  are  sets  of  nodes  joined  by  arcs.  It  can  be  deHned  as  follows: 
n=(NA)whCTe,  n  is  a  directed  graph, 

N  is  a  set  of  n  nodes  on  which  we  can  associate  a  weight  function  Q  :  N  ->  R.  In 
ours  studies  (^i)=l  for  i=l,..,  n, 

A  s  a^j,  are  node  pairs  that  define  the  arcs.  It’s  known  as  adjacency  matrix,  and  it 
defines  the  arc  weight  of  FI- 

The  problem  consists  in  dividing  the  graph  in  K  different  subgraphs  n={ni,...,  according  to  certain 
constraints.  The  classic  constraints  are: 
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-  The  subgraphs  must  have  a  specify  size  . ^  weight  sum  of  nodes  less  than  a 

given  value  V  Nf|j^  <  given  value  where,  N[||^  =  ZQ(i)  for  lc=l...K 

Ilk  ie  Ilk 

-  The  arcs  with  extremities  in  different  subgra{^  must  be  minimal,  or  the  weight  sum  of  arcs  which  join  nodes 
which  are  in  different  subgraphs  must  be  mimimized 

£  Sjj  <  must  be  mimimized  D={i  €  &  j  e  11]  &  1  ^  k) 

ijeD 

The  cost  function  associates  a  real  value  to  every  subgraph  configuration.  It  permits  to  calculate  the  cost  of  a 
subgraphs  configuration,  according  to  the  constraints  defined  in  the  cost  function.  To  study  this  problem  as  a 
graph  partitioning  optimization  problem,  we  use  the  cost  function: 

F=  I  aij  +  b(X(Nn.-"^)2)/K 

i.j  e  D  k=l  ^  ^ 

This  cost  function  was  defined  in  [10]  as  the  minimization  of  the  interconnection  and  imbalance  costs.  This 
later  constraint  consists  on  minimizing  tte  node  variance  between  the  subgraphs. 

The  balance  facUM'  (b)  defines  the  importance  of  the  interconnection  cost  with  respect  to  imbalance  cost,  and 
NOk  is  the  tasks  number  in  V  k=l  ..  K. 

The  graph  partitioning  problem  is  reduced  to  find  a  subgraph  configuration  with  minimum  value  for  the  cost 
function;  FI  =  MIN(F)  (2) 

3.  THE  RANDOM  NEURAL  NETWORK 

A.  Random  Network  Model 

The  Random  Network  (RNN)  model  has  been  introduced  by  Gelenbe  [2,  6]  in  1989.  Signals  in  this  model 
take  the  form  of  impulses  which  mimic  what  is  presently  known  of  inter-neural  signals  in  biophysical  neural 
networks. 

We  shall  recall  here  the  principal  characteristics  of  the  RNN.  The  model  consists  of  a  network  of  n  neurons  in 
which  positive  and  negative  signals  circulate.  Each  neuron  accumulates  signals  as  they  arrive,  and  can  fire  if  its 
total  signal  count  at  a  given  instant  of  time  is  positive.  Firing  then  occurs  at  random  according  to  an 
exponential  distribution  of  constant  rate,  and  signals  are  sent  out  to  other  neurons  or  to  the  outside  of  the 
netwmk.  Each  neuron  i  of  the  network  is  represented  at  any  time  t  by  its  input  signal  potential  kj(t). 

Positive  and  negative  signals  have  different  roles  in  the  network.  A  negative  signal  reduces  by  1  the  potential 
of  the  neuron  to  which  it  arrives  (inhibition)  or  has  no  effect  on  the  signal  potential  if  it  is  already  zero;  while 
an  arriving  positive  signal  adds  1  to  the  neuron  potential.  Signals  can  either  arrive  to  a  neuron  fiom  the  outside 
of  the  network  or  from  other  neurons.  Each  time  a  neuron  fires,  a  signal  leaves  it  depleting  the  total  input 

potential  of  the  neuron.  A  signal  which  leaves  neuron  i  heads  for  neuron  j  with  probability  p'*'(i  j)  as  a  positive 
signal,  or  as  negative  signal  with  probability  p-(ij).  or  it  departs  from  the  network  with  probability  d(i). 

Clearly  we  shall  have:  ^'j=l  ^  1^^. 

Positive  signals  arrive  to  the  ith  neuron  according  to  a  Poisson  process  of  rate  A(i).  Negative  signals  arrive  to 
the  ith  neuron  according  to  a  Poisson  process  of  rate  X(i).  The  rate  at  which  neuron  i  fires  is  r(i). 

The  main  property  of  this  model  is  the  excitation  probability  of  a  neuron  i,  q(i).  It  is  defined  as; 

q(i)  =  A.+(iV(t(i)+r(i))  (3) 

where,  =  2‘j_(q(j)t<J)P^0.i)+A(i)  A(i)=  arrival  rate  of  external  positive  signals, 

A.'(i)  =  D"j_jq(j)r(j)P'(j>>)+Mi)  Mi)=  arrival  rate  of  external  negative  signals. 

If  a  unique  non-negative  solution  exists  to  equation  (3)  such  that  each  q(i)  <  1,  then  the  stationary  probability 
distribution  is  p(k)  =  n"i_i  (l-q(i))q(i)*^‘^  k(t)  :  vector  of  signal  potentials  at  time  t. 

To  guarantee  the  stability  of  the  RIW,  the  following  is  a  sufficient  condition  for  the  existence  and  uniqueness 
of  the  solution  in  the  equation  (3)  A(i)  +  Z‘j_|r0)p'*’0>i)  <  Ki)  +  ^i) 

B.  Relation  between  the  RNN  Model  and  the  Network  Learning  to  Optimization  Problems 

i)  Learning  in  the  recurrent  RNN  model: 

In  the  RNN  model,  the  weight  parameter  w+fi  j)  and  w'(i  j)  are  defined  as: 
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W(ij)  =  r(i)p‘(ij) 


w+Oo)  =  r(i)p+(ij) 

and  r(i)  =  [w+(ij)  +  w-(ij)] 

They  represent  rates  at  which  positive  and  negative  signals  are  sent  out  from  any  neuron  i  to  neuron  j. 
Gelente  has  proposed  an  algorithm  in  [9]  for  choosing  the  set  of  network  parameters  W  in  order  to  learn  a  given 
set  of  m  input-ou^ut  pairs  (£2,  B)  where  the  set  of  successive  inputs  is  denoted: 

Q  =  {Q], £2jd}  where  ^^~(^m’^inl 

Ani  =  (Anj(l),....An,(n))  ^  =  (>Tn(l) . M")) 

The  successive  desired  outputs  are  the  vector  B  =  {Bj, ....  Bjjj),  where  Bj^  =  (Bni(l) . ®m(*)  ^ 

[0,1]  correspond  to  the  desired  output  vectors.  The  network  approximates  the  set  of  desired  output  vectors  in  a 
manner  which  minimizes  a  cost  function  Em'. 

Em  =  7  I  ai(q(i)-Bm(i))2  ai>0 

=  1 

The  algorithm  lets  the  network  learn  both  n  by  n  weights  matrices  Wjj,'‘'={Wjjj'*'(ij)}  and  Wjj^={Wjjj‘(ij)}  by 
computing  for  each  input  Qm>  ^  value  w^***  and  using  gradient  descent.  The  rule  to  update  the 
weights  may  be  written  as: 

w>,  v)  =  wm.i(u.  v)  -  p  £  aj  (qn/i)  -  BmO))  [3^^]  W 

wh«e,  p.  >  0  is  some  constant  9m^‘^ calculated  using  and  Wjjj(u,v)  =  w j  (u,v)  in  (3) 

[3q(i)  /  3w(u,v)]m  is  evaluated  of  the  values  q(i)  =  qmCO  attd  w(u,v)  =  Wjjj_j(u,v)  in  (4) 

The  complete  learning  algorithm  for  the  network  is: 

-  Initiate  the  matrices  Wq'*'  and  Wq'  in  some  appropriate  manner.  Choose  a  value  of  p  in  (4). 

•  For  each  successive  value  of  m: 

-  Set  the  input-output  pair  (£2in> 

-  Repeat 

-  Solve  the  equation  (3)  with  these  values 

-  Using  (4)  and  the  previous  results,  update  the  matrices  Wjjj"*"  and  Wjj^' 

Until  the  change  in  the  new  values  of  the  weights  is  smaller  than  some  predetermined  valued. 

ii)  Optimization  using  RNN  model: 

In  the  RNN  model,  q(i)  depends  on  A(i),  X,(i),  p'‘'(j.i).  P'(j*i)’  •ti)  and  the  other  q(j)s.  The  weight  between 
neurons  is  characterized  by  p'*'(j>i)>  P*(j>i)  ^<><1  >’(i)-  update  of  these  parameters  is  logical  in  the  learning 
phase.  In  the  optimization,  of  every  iteration  we  redefine  the  network  without  a  change  in  the  weights.  By  this 
way,  p'''(j>i)>  P'(j<i)  f(il  are  fixed  and  depend  on  the  nature  of  combinatorial  problem.  Besides,  in  the 
optimization  problem  the  relationship  between  two  neurons  is  competitive  or  cooperative,  that  is  either  p'*'(j.i) 
or  p'(jfi)  is  null.  Of  course,  if  there  are  not  interaction  between  them,  both  p*Q,i)  and  p'0<i)  are  null.  On  the 
other  hand,  emission  of  external  signals  is  not  interesting  to  optimization,  it  is  better  to  employ  the  signals  to 
inhibit  or  to  excite  the  neighbor  neurons,  that  is  d(i)  is  null.  The  fire  rate  r(i)  is  obtained  by  the  reciprocity  of 
effect  between  neurons.  When  two  neurons  i  and  j  are  excited  and  i  emits  signals  to  j,  the  excitation  or 
inhibition  that  i  exerts  over]  must  be  the  same  as  excitation  or  inhibition  that  i  receives. 

If  the  weights  are  fixed,  the  only  way  to  lead  the  network  from  one  stationary  state  to  another  one  is  to  act 
over  the  inputs.  This  state  of  the  RNN  model  is  defined  by  (q(i), ...,  q(n)).  The  use  of  two  externals  flows  to 
every  neuron  permits  a  complex  scaling  of  an  external  positive  flow  to  an  external  negative  flow  [8].  In 
optimization,  the  use  of  two  flows  is  not  interesting.  We  consider  X^i)  as  null  so  that  the  neurons  only  receive 
external  positive  signals,  representing  the  preference  that  the  neuron  belongs  to  the  solution.  By  this  way,  q(i) 
and  A(i)  become  the  variables  of  the  RNN  model. 

We  define  a  dynamic  of  external  positive  signals  in  RNN  model,  in  order  to  find  the  state  that  gives  the 
minimal  energy  in  the  network.  Using  the  technique  of  gradient  descent,  the  dynamic  of  external  excitation 
signal  is  defined  as:  A(u)™'‘‘*=A(u)™-^[3E/3A(u)]"*  in  the  m-th  iteration  (5) 

where  E  is  the  energy  function  of  the  system  state. 

This  equation  describes  the  control  that  is  necessary  to  applv  to  the  system  to  minimize  the  energy  function. 
The  optimization  with  RNN  model  is  the  same  as  problem  r  al  control.  This  method  uses  a  technique  of 
learning,  where  the  network  learns  to  minimize  the  energy  fi 

The  general  algcnithm  for  the  RNN  model  in  optimization  i  m  with  learning  is: 
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-  Initialize  A(i)  in  some  q>propriate  manner 
-Rq>eat 

-  Solve  the  equation  (3) 

-  Using  (S)  and  the  {wevious  results,  update  A(i). 

If  A(i)  is  outside  of  [0,  KOI.  replace  for  the  nearest  bounds 
Until  the  change  in  the  new  value  of  q(i)  is  smaller  than  some  predetennined  valued. 

Hi)  Application  of  the  RNN  model  in  partitioning  problem: 

We  use  the  two  solutions  presented  in  [7]  for  tUs  problem.  We  like  study  in  this  work  the  capability  of  the 
RNN  nsodel  as  dynamic  method  using  gradient  descent.  The  two  solutions  in  [7]  are  defined  as  following: 

In  the  first  solution  (RNNl),  we  study  the  space  of  possible  solutions.  To  model  this,  we  use  nK+K  neurons 
of  two  types.  There  are  nK  neurons  of  type  Nj(i.k)  that  represent  one  element  of  the  solution  space,  where  i  is 
the  node  Ti  and  k  the  partition  Ilk.  K  neurons  of  the  type  N2(k),  which  represent  the  load  regulator  of 
partition  Flk- 

a)  For  N],  if  qi(i,k) »  1  this  solution  is  admitted.  There  are  negative  links  between:  Nj(i,k)  and  Nj(i,z)  where 
k  ^  z  (with  incompatibles  solutions,  with  probability  p]''((i,k),(i,z))),  N](i.k)  and  N](i.z)  where  k  ^  z  and  ajj=l 
(successor  nodes  which  are  in  different  partitions,  with  p|‘((i,k),(j,z))  probability)  and,  Nj(i,k)  and  N2(k)  (the 

node  i  belongs  to  partition  k  with  regulator  k,  with  probability  p]'((i,k),k)).  There  are  excitation  links  between 
N](i,k)  and  N}(j>k)  where  a^j^l  (the  successor  nodes  which  are  in  the  same  partition,  with  probability 

Pl'*’((i,k),(j,k))). 

b)  For  N2(k),  if  q2(k)“  0  the  partition  has  arrived  to  its  maximal  capacity.  There  are  only  positive  links 
between:  N2(k)  and  N}(i,k)  for  i=  1  ...  n  (the  regulator  k  with  the  nodes  belonging  to  partition  k,  with 

probability  P2‘*’(k,(i,k)))  and,  N2(k)  and  N2(z)  where  k  ^  z  (the  other  load  regulator,  with  probability  P2'*’(k,z)). 
The  equations  of  the  system  are: 

qi(iJc)»Xl/Yl  andXl=I^kX.^^.j^j  qiOvZ)»‘l(j.z)Pl''‘(0.2).(i.k))  +  q2(k)r2(k)p2'^(k,(i,k)) 

Yl=ri(i,k)  +  X2^k5^'j^,4(aji=  1  or  j=i)  ‘ll(i.z)ri(i.z)pr(a.z).(i.k))  (6) 

q2(k) »  X2/Y2  and  X2  =  A2(k)  +  q2(z)r2(z)P2'*‘(z4c) 

Y2  =  r2(k)  +  X"j_j  qi(i,k)ri(i.k)pf((i,k),k) 

And,  the  model  parameters: 

di(iJc)=d2(k)=Xi(i4c)=X2(k>=A|(iJk)=0  A2(k>=  n/K 

ri(j.z)  Pi'*’((j,z),(idc))  =  r  j(j,z)  Pi*(0‘,z).(i,k))  =  ri(i,k)  pi'((i,k),k)  =1 
^2(1^)  P2'''(k.(i.k))  =  r2(z)  P2'‘'(z,k)  =  1 

In  the  secoi  ion  (RNN2),  we  start  with  an  initial  solution  that  we  will  try  to  improve.  To  model  this, 

we  use  n+K  nt  of  two  types:  n  neurons,  N](i,k),  represent  the  k  partition  to  which  the  node  i  belongs;  and 

K  neurons,  N2(k),  represent  the  load  regulator  for  every  partition. 

a)  For  N},  if  qi(i,k>=l  the  task  i  is  accepted  in  the  k  partition.  There  are  positive  links  between  Ni(i,k)  and 
Ni(j,k)  where  ajj=l  (the  successor  nodes  if  are  in  the  same  partition  with  probability  p]'*'((i,k),(j,k))).  There  are 
negative  links  between:  N|(i,k)  and  Ni(j,z)  where  k^z  and  a|j=l  (the  successor  nodes  if  are  in  different 
partitions,  with  probability  Pi'((i,k),(j,z)))  and,  N](i,k)  and  N2(k)  (the  node  i  belongs  to  partition  k  with 
regulator  k,  with  probability  pi'((i.k).k))- 

b)  For  N2(k),  if  q2(k>=  0,  the  partition  has  arrived  to  its  maximal  capacity.  There  are  only  positive  links 
between:  N2(k)  and  N}(i,k)  for  i=  1  ...  n  (the  regulator  k  with  the  nodes  belonging  to  partition  k,  with 

probability  P2‘*'(k,(i,k)))  and,  N2(k)  and  N2(z)  where  k  ^  z  (the  other  load  regulator,  with  probability  P2'*’(k,z)). 
The  equations  of  the  system  are: 

qi(i,k)=sX3/Y3  and  X3=X2_|jXj^^ji_I  q j(j,z)ri0,z)pi'''(0,z),(i,k)>tq2(k)r2(k)p2'*’(k,(i,k)) 

Y3a:  ri(i,k)  +  Iz;tk^j^&aji=l  qi(i.z)ri(j,z)pf  ((j,z),(i,k))  (7) 

q2(k)  =  X4/Y 4  and  X4  =  A2(k)  +  X^^  q2(z)r2(^)P2'‘’(2’l^) 

Y4  =  r2(k)  +  Xi  g  k  qi(i,k)ri(i.k)pr((i.k),k) 


And  the  model  paiameiers.' 

di(iJt)-d2(k)«Xi(iJc>»X2(k)=Ai(i4c>»0  A2(k)=n/K 

=  ri(j.z)pi'((/.z).(U))  =  ri(i.k)pi'((ijc)jc)  =  1 
r2(k)P2'*'(k,(iJi:))  =  r2(2)P2'^(z.k)  =  1 

iv)  Application  of  the  recurrent  RNN  model  in  the  partitioning  problem: 

In  diis  article,  we  use  a  method  where  the  neui^  network  learns  to  optimize  based  in  an  improves  to  RNN 
model.  To  define  the  dynamic  of  external  positive  signals  in  RNN  model,  in  order  to  find  the  state  that  gives  the 
minimal  energy  in  the  model,  we  introduce  Ai(i,k)  as  control  parameter  in  the  system  equations  (6)  and  (7),  and 
use  the  cost  function  (1)  as  energy  function  (E).  The  new  system  equations  for  RNNl  is 
qj(U)*Xl/Yl  and  Xl=Ai(i,k)  +  Z2^Ij^^jj_jqi0.z)ri(j.z^l‘'‘(G.z).(i.k))  +  q2(k)r2(k)p2'''(k,(ijc)) 

Y1  is  identical  to  Y1  in  the  system  equation  (6)  (8) 

q2(k)  is  identical  to  q2(k)  in  the  system  equation  (6) 

and,  for  RNN2  is 

qi(iac)aX3/Y3  and  X3=Ai(iJc)  +I2=k5^&^i=l  qi0.z)riO^¥l‘*‘((j.z).(i4c)>Hl2(k)r2(k)p2'‘’(k,(i4c)) 

Y3  is  identical  to  Y3  in  the  system  equation  (7)  (9) 

q2(k)  is  identical  to  q2(k)  in  the  system  equation  (7) 

To  search  value  optimal  to  Aj(i,k)  and  A2(k),  starting  with  an  initial  value  A]^(i,k)  and  A2^(k),  we  use  the 
gradient  decent*  Aj  ”Hu,v)  =  Aj™'*  (u,v)-  p.  [dE/dAj  (u,v)]™  (10) 

nd  \2  “Hv)  =  A2  “-1  (V)-  p  [aE/aA2  (v))™  (1 1) 

which  guarantee  that  E™** 

The  energy  fimction  (E)  for  the  graph  partitioning  problem,  ^cording  to  the  equation  (I)  is 

E=  X  aijqi(i,k)qiG,l)  +  b(X(Nn^q2(k)-')/  )2)/K 

i.j  e  D  k=l  '  '' 

where  D={i  e  11)^  &  j  e  Jli  &  I  ^  k) 

To  determine  aE/aAi(u,v)  and  aE/aA2(v)  we  use  the  equations  (8),  (9),  (10),  (U)  and  (12) 
aE/aAi(u,v)  =  (X  (ajjl(i>j]  +  ajil[i<j] )  qi(i,k)]  aq|(i,k)/aA2(u,v) 

aE/aA2(v)  =  (IVK)  [  q2(k)  -  Nn^(n/K))  1  aq2(k)/aA2(v) 

where  aqi(i,k)/aAi(u,v)  =  1/Yj^j  [l-Ci]'! 

aq2(kyaA2(v)  =  I/Y^  [1-C2]-1 

and  Ci=[Xz=k2^j^&3ji^iri(j,z)pi+((j,2).(iJc))]/Yj^j-PC^jX2^X"j^l&(aji^,o,j^)ri(j,z)p,-(a.z).(iJc))]nfj^j2 
C2  =  q2(z)r2(z)  /  Yj^^  RIWl  kl  =  1  and  k2  =  2  or,  K1  =  3  and  K2  =  4  for  RNN2. 

4.  PERFORMANCE  EVALUATION 

We  compare  the  random  neural  model  with  the  approximate  methods  studied  in  [10]:  genetic  algorithm  (GA), 
simulated  annealing  (SA)  and  kemighan's  heuristic  (H).  We  have  used  the  parameters  that  give  the  better 
performance  in  every  method,  according  to  the  results  of  the  work  [10].  We  have  used  a  SUN  SPARCstation 
IPC  with  16M  of  memory  and  a  matrix  as  data  structure.  The  graphs  used  are  defined  for  the  number  of  nodes 
(N)  and  the  average  degrees  of  the  nodes  (D).  The  execution  time  is  in  seconds. 

For  graph  of  little  size  (^15  nodes)  simulated  annealing  and  recurrent  RNN  model  gives  the  optimal  solution 
(table  1).  In  general,  the  result  quality  and  the  execution  time  are  approximately  the  same.  The  difference 
between  the  exact  solution  and  the  results  of  the  approximate  methods  is  little  and  the  execution  time  similar. 
For  graph  of  SO  or  more  nodes  the  approximate  methods  are  more  interesting,  because  they  have  a  reasonable 
execution  time  to  find  a  suboptimal  solution.  Recurrent  RNN  model  that  starts  with  an  initial  solution  gives 
the  best  results.  In  general,  the  qualities  of  the  results  of  recurrent  RNN  model  and  simulated  annealing  method 
are  similar,  but  recurrent  RNN  models  execution  times  are  a  lot  less. 

Recurrent  RNN  model  gives  better  results  than  RNN  model  that  not  uses  gradient  descent,  because  in  the  first 
model  the  network  learns  to  optimize.  The  improve  of  the  results  is  very  important.  The  RNN  model  that  starts 
with  an  initial  solution  gives  better  results  than  the  RNN  model  that  uses  all  solution  space. 


Graph  Type  Method 

b  I  -  ■■■I  ■■  II 


N  K 

D 

EXAC 

H 

SA 

GA 

RNNl 

RNN2 

RNNl-rec 

RNN2-rw: 

Tim  Cos 

Tim  Cos 

Time  Cost 

Time  Cost 

Time  Cost 

Time  Cos 

Time  Cos 

Time  Cos 

0  10  2 

2 

7 

0 

6 

2 

13 

0 

6 

2 

11 

0 

7 

0 

6 

0 

4 

0 

1  10  2 

2 

3 

2.5 

4 

2.5 

8 

3.5 

3 

2.5 

6 

2.5 

6 

4 

5 

2.5 

4 

2.5 

1  15  2 

2 

85 

2.5 

5 

5.5 

7 

3.5 

4 

4.5 

11 

5.5 

6 

6.5 

5 

5 

5 

4 

1  50  5 

5 

10 

15 

148 

9.5 

48 

15 

10 

11 

8 

18 

7 

10.5 

7 

7.5 

0  100  5 

5 

0 

22 

66 

6983 

17 

399 

0 

37 

0 

43 

0 

63 

0 

52 

0 

15 

66.8 

6610 

35.8  356 

58.8 

33 

63.5 

47 

65 

65 

52.8 

54 

32.5 

TABLE  1.  Results  of  the  simulations 
5.  CONCLUSIONS 

The  experiments  we  have  run  show  that  the  results  obtained  by  each  approximate  method  vary  widely 
depending  on  the  type  and  size  of  the  graphs  considered.  In  our  study  all  the  methods  give  good  results,  but 
recurrent  RNN  model  gives  the  best  results  for  large  graph  with  short  execution  time.  In  RNN  model,  the 
approaches  that  start  with  an  initial  solution  give  the  better  results. 

The  use  of  learning  algorithm  for  optimization  problem  in  RNN  networks,  improves  the  previous  results 
obtained  with  this  model.  The  news  results  are  better  or  same  as  the  results  of  simulated  annealing  (this  method 
gave  the  best  results  in  [10]),  and  the  execution  time  is  similar  to  RNN  network  without  feedback  control.  The 
problem  in  recurrent  RNN  networks,  is  the  number  of  variables  (p,  threshold)  that  are  necessary  to  control. 

The  execution  time  for  the  Genetic  Algorithm  and  Simulated  Annexing  are  very  large.  For  Genetic 
Algorithm,  the  reason  is  that  generation  calculations  take  relatively  much  more  time.  It  is  necessary  to 
determine  the  better  combination  of  genetics  operators,  to  decrease  the  number  of  necessary  generations  to  reach 
the  subopdmal  solution.  For  Simulated  Annealing,  since  it  is  not  possible  determine  coherent  movments  of 
nodes  in  every  temperature  level  that  decrease  the  energy,  the  solution  is  evaluated  in  a  relatively  longer  time. 
The  Genetic  Algorithm  and  the  Random  Neural  Model  are  easy  to  implement  on  a  parallel  machine,  and  this  can 
considerably  inqxove  the  speed  obtained  with  these  methods. 

Future  research  will  apply  these  algorithms  in  the  optimization  of  the  parallel  program  speedup;  will 
examine  other  combinatorial  optimization  methods  for  the  solution  of  design  problems  in  distributed  systems 
(tasks  assignment,  fdes  allocation, ...);  will  consider  a  combination  of  the  Random  Neural  Model  and  Genetic 
Algorithms;  and  will  implement  these  algorithms  on  parallel  machines. 
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ABSTRACT 

The  LMS  energy  function  becomes  less  useful  for  the  class  of  bisUtiile  ambiguity  figures,  because  the 
desired  output  is  ambiguous.  A  revised  back  error  propagation  network  based  on  Haken  Synergetic  Computer's 
polynomial  energy  function,  e.g.,  ^  field,  is  proposed  for  the  recognition  of  bistable  reversible  figures.  The  training 
of  the  network  follows  a  modified  supervised  delta  learning  rule  of  interconnected  weights.  The  test  of  "reversible 
figures"  is  subsequently  controlled  by  the  double  well  potential  phase  transition  tuning  parameter  given  test  image 
data.  The  effects  of  the  tuning  parameter  efumging  potential  from  a  sing  well  into  a  double  well  are  illustrated.  fV  e 
demonstrate  that  the  networks  trained  with  the  new  energy  function  generally  have  better  performance  in  training 
speed  and  classification  of  patterns  than  the  standard  back  error  propagation  networks  trained  by  the  least  mean 
square  energy. 

Keywords:  Reversible  figures,  bistable,  neural  networks,  phase  transition. 

1.  IN11l<M>UCnON 

Recent  advances  in  top  down  design  of  artificial  neural  networks  have  yielded  improvement  in  recognition 
of  objects  belonging  to  identifiable  and  therefore  labelled  classes.  The  reversible  figure,  vase/face  (Figure  1 .  pattern 
1),  problem  reinesents  a  different  challenge,  where  the  date  can  not  be  a  prior  defined.  One  picture  can  belong 
equally  to  two  different  classes  depending  on  the  perception  environmental  conditions.  We  could  classify  it  as  either 
a  vase  or  two  old  men  facing  each  other.  Since  both  interpretations  are  correct,  a  network  is  forced  to  either 
converge  on  a  single  pattern  or  separate  the  pattern  into  two  unrelated  classes.  A  single  minimum  energy  function 
such  as  LMS,  must  merge  these  two  interpretations  into  one  class  or  several  classes.  We  adapt  the  single  order 
parameter  double  well  potential  function  of  Haken's  Synergetic  Computer[l][4]  as  the  Artificial  Neural  Network 
ener^  fimetion,  with  Haken's  attention  parameter  as  the  phase  tuning  parameter  to  resolve  reversible  figures.  The 
energy  function  could  evolve  into  a  double  potential  well  with  appropriate  attention  parameters.  The  separate 
minimums  could  be  used  to  represent  the  two  feature  patterns  of  the  bistable  figures.  We  assiune  a  standard  back 
error  propagation  network  using  three  layers  of  sigmoidal  nodes.  Connections  are  made  from  the  bottom  layer  to 
the  hidden  layer  and  then  from  the  hidden  layer  to  the  output  layer  with  no  local  recurrent  or  intralayer  cormections. 
We  test  our  network  using  reversible  figure  input  patterns  and  patterns  with  small  symmetry -breaking  perturbations 
and  noises. 

2.  GENERIC  BFN  TRAINING  RULE 

We  begin  with  a  standard  feed-forward  back  error  propagation  network  architecture  with  the  uplink  W^ 
between  the  output  layer  neurons  (in,out)  =  (u,v)  and  the  hidden  layer  neurons  (in,out)  =  (u',v')  and  lower  link  W^' 
between  the  hidden  layer  neurons  and  the  input  layer  neurons  (in,out)  =  (u",v").  From  the  gradient  descending 
learning  rule,  we  could  derive  the  following  BPN  weights  update  rule  by  keeping  the  energy  function  E  until  the 
last  step  of  its  substitution. 

For  general  E,  the  weights  between  output  layer  and  hidden  layer  are  given  as: 


m-426 


(1) 

(2) 


I»V<+1)  =  W/f)  + 

where  ft|  ■  — ^  (3) 

du,  av, 

For  the  weights  between  hidden  layer  and  input  layer: 


+  AWh 

(4) 

AW'^  =  (A>;Va#) 

(5) 

where  A,' - 

(6) 

These  are  the  generic  delta  training  rule  based  on  an  unspecified  energy  function.  It  could  be  seen  that  the 
is  the  key  element  of  the  six  equations.  It  has  been  shown  that  the  least  mean  square  ener^  function,  minimum 
misclassiflcation  error  functions,  and  minimax  energy  functions  are  all  valid  for  backpropagation  type  networks. 
All  these  diRerent  functions  have  limitations.  The  least  mean  square  technique  is  most  common  and  deeply  rooted 
in  our  thinking  because  it  is  most  general.  This  technique  has  earned  its  place  in  history  for  its  ability  to  find  best 
fit  for  a  wide  variety  of  data  sets.  The  MME  technique  helps  to  separate  overlapping  features  that  cause  confusion 
between  two  distinct  classes  of  objects  and  are  generally  more  appiropriate  for  automatic  target  recogmtion  problems 
[8].  The  development  of  minimax  gives  us  another  tool  for  determination  of  unknown  feature  vectors  necessary  for 
class  separation. 

3.  QUARTIC  POTENTIAL  ENERGY  FUNCTION  TRAINING  RULE 

The  general  quartic  energy  function  in  terms  of  Haken's  order  parameters  ^  has  the  following  forms, 

V=Ai*  *  *  E  (7) 

Using  the  case  of  Hopf  bifurcation  (A=l/4,  B=0,  C=AJ2,  I>=0,  E=0)  when  X,  <  0  the  potential  energy  V  has  a  single 
minimum  at  the  origin  §  =  0.  When  X,  >  0  a  double  well  exists  with  the  minimum  located  at  ^  and  a 

minimum  value  of  ~X,V4.  We  are  now  left  with 


(8) 


This  energy  function  is  of  the  same  form  as  the  single  order  parameter  potential  function  of  Haken's  Synergetic 
computers.  By  adapting  Haken's  potential  function  as  our  networks  training  energy  function,  we  can  derive  our 
specific  weights  update  rules.  We  will  use  the  attention  parameter  X,  as  the  phase  tuning  parameter  to  see  its  effects 
on  the  network  performance. 


For  X,  >  0,  we  set 

Where  Tj  is  the  target  output  value  of  output  neuron  i,  Vj  is  the  actual  output  value  of  the  output  neuron. 
When  Tj  =  Vj,  ^  are  the  two  attractor  positions  representing  face  and  vase  respectively.  Substituting  both 

definition  (9)  and  the  new  energy  function  (8)  into  the  arbitrary  energy  function  delta  traming  rule  formula  (3),  we 
have 

where  ^  ^ 

For  X,  <  0,  we  set  ^ 
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The  substitution  of  these  specific  into  the  general  weights  update  rule  Eqs.(3,6)  yields  the  corresponding 
training  rule. 

4.  SIMULATKtNS 

To  test  our  energy  function  we  break  down  the  reversible  figure  of  face/vase  into  a  two-dimensional  binary 
array.  Using  a  30  X  30  pixels  picture  gives  us  sufficient  resolution  to  identify  both  figures  in  the  pattern.  All  the 
training  pattern  and  testing  pattern  are  listed  in  Figure  1.  We  use  pattern  1  and  pattern  2  as  the  training  patterns  for 
face  and  vase  respectively.  Pattern  3  through  pattern  8  are  the  perturbed  test  pattern,  pattern  9  through  pattern  14 
are  used  to  test  the  connectivity  performance  of  the  trained  networks.  The  face  and  vase  figures  each  have  450 
pixels,  while  the  perturbation  figure  apple  and  earrings  each  have  60  pixels.  To  train  on  this  pattern,  we  use  900 
input  neurons.  We  vary  the  number  of  hidden  layer  nodes  to  ensure  generality,  but  limit  ourselves  to  no  more  than 
one  percent  of  the  input.  Our  output  layer  consists  of  two  nodes,  each  representing  the  presence  of  one  attractor. 
Weights  are  allowed  to  train  imtil  convergence  and  test  patterns  are  then  identified.  The  performance  are  then 
compared  to  the  non-modified  I,MS  networks.  We  add  momentum  term  in  the  training  algorithm  of  both  network 
to  accelerate  the  training  speed,  with  the  learning  coefficient  set  to  0.15  and  momentum  coefficient  set  to  0.075  the 
general  form  of  our  learning  formula  is 

1) = +0.15a,Vy +0.075 A  W/t- 1)  ^ ,  2) 

Initial  weights  are  random  values  between  ±0.015.  The  network  stopping  training  condition  is 

=  JXTt-vf  <  0.01 
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Figure  1.  Training  and  testing  patterns.  Pattern  1  &  2 
are  the  two  training  patterns  for  the  memory  networks. 


We  test  the  typical  LMS  energy  neural  network  converging  speed  while  varying  the  number  of  bidden  units. 
The  result  is  shown  in  Figure  2.  When  hidden  unit  number  equals  to  1 ,  2,  3,  5,  there  is  no  convergence.  When 
hidden  unit  number  equals  to  4,  the  network  converges  after  25  iteration,  with  6  bidden  units,  number  of  iterations 
drops  to  19.  But  because  each  iteration  takes  more  time  with  6  hidden  units  than  with  4  hidden  units,  so  the  actual 
speed  may  not  improve  much.  As  we  will  see  later  that  all  the  network  trained  poorly  when  there  are  5  hidden  umts. 
This  suggests  that  the  number  of  the  hidden  units  should  be  kept  as  syirunetry  with  input  units  number  and  output 
units  number.  Since  we  have  even  numbers  of  input  and  output  units,  the  hidden  units  should  be  even  number  too. 

We  then  test  the  training  speed  of  the  double  well  potential  energy(^  >  0)  neural  network(  Figure  3  ). 
Comparing  with  Figure  2,  now  we  can  have  networks  convergence  with  3,  4,  5  and  6  hidden  umts  by  choosing 
proper  X  values.  This  reveals  better  converge  ability  of  the  double  well  neural  networks.  The  best  overall 
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performance  is  at  4  hidden  units  and  X  equals  to  14  and  16,  where  the  networks  converged  only  after  10  iterations, 
that  is  about  150%  faster  than  the  LMS  network.  Generally,  by  increasing  X  value,  we  could  have  faster 
convergence.  But  there  is  a  limit  on  the  increase.  After  certain  value  of  X  the  network  can  not  converge.  We  believe 
this  is  because  of  the  general  gradient  <tescend  training  method...  with  bigger  X,  we  have  steeper  gradient,  bigger 
correction  of  weights  after  each  iteration.  So  the  network  will  be  faster  to  reach  the  energy  landscape  minimum.  But 
when  the  step  becomes  too  big,  it  leads  oscillation  within  the  potential  well,  the  system  will  be  trapped  in  a  middle 
state  and  never  converge  to  the  minimum.  This  explanation  could  b<.'  further  confumed  by  observing  the  evolution 
of  the  E,  value  after  each  iteration  during  the  training.  For  the  normal  converged  training,  we  observe  that  the  change 
of  the  E,  value  AE,  is  quite  big  initially.  It  becomes  smaller  and  smaller  after  each  iteration.  We  can  see  a  clear 
trend  moving  toward  zero  until  that  E,  reaches  a  preset  stuffing  condition.  In  the  case  of  big  X,  we  observed  a  very 
big  E,  value  changes  initially  and  then  drops  abruptly  to  a  very  small  values.  The  trend  did  not  converge  toward 
zero  but  an  intermediate  value,  i.e.  the  system  was  trapped  in  a  spurious  intermediate  state. 


Figure  2.  The  effect  of  different  hidden  layer  neuron  Flgui*  3-  The  effect  of  different  phase  tuning 


numbers  on  the  training  speed  of  a  typical  LMS  back  parameter  X  values  and  hidden  layer  unit  numbers  on 


error  propagation  network.  Training  pass  equaling  the  converging  speed  ofdoublc  potential  well  energy 


sixty  means  non-converged  training. 


trained  memory  networks. 


Figure  4.  The  effect  of  different  phase  tuning  Figure  5.  The  comparison  of  the  output  standard  deviation 
parameter  X  values  and  hidden  layer  unit  numbers  on  different  networks:  a  LMS  with  four  hidden  units, 

the  converging  speed  of  a  single  potential  well  energy.  ®  LMS  with  six  hidden  units,  a  single  potential  well  with 

four  hidden  imits  and  X  =  10,  a  double  jxitential  well  with 
four  hidden  units  and  X  =  10.  It  shows  that  these  two  LMS 
networks  have  much  bigger  standard  deviation  errors  for  all 
test  patterns. 
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Figure  4.  shows  the  training  speed  of  the  single  well  potential  neural  network  <  0)  Compare  with  the 
double  potential  well  network,  we  have  pretty  much  the  same  date  structure,  but  a  bigger  range  of  the  converging 
X  value.  This  is  due  to  that  the  double  potential  well  is  separated  by  a  middle  point  at  VK),  such  that  only  one  side 
of  the  double  well  become  rather  shallow.  With  a  bigger  X  value,  the  training  step  could  easily  jump  over  the 
middle  point  and  get  stuck  in  the  opposite  well.  Since  the  single  potential  well  has  no  such  tendency,  it  can  accept 
a  much  bigger  X  values.  Another  interesting  aspect  of  this  single  well  network  is  that  we  now  can  have  two  hidden 
unit  that  still  converge. 

To  test  the  ability  of  recognition  ability  among  these  three  networks,  we  run  all  fourteen  test  patterns 
through  these  three  networks,  the  result  is  showed  in  Figure  5.  It  is  clear  that  the  network  trained  by  using  Haken's 
potential  energy  has  much  better  performance  in  recognizing  the  symmetry-breaking  perturbed  patterns.  The 

standard  deviation  fourteen  test  patterns  is  less  than  0.06.  This  is  because  the  first  two 

patterns  are  the  actual  training  patterns  while  the  standard  deviation  value  of  all  the  other  test  pattern  should  use 
this  value  as  the  reference,  keeping  this  in  mind,  we  can  see  that  all  twelve  perturbed  patterns  are  perfectly 
recognized.  The  recognition  ability  of  the  LMS  neural  networks  is  quite  different.  In  case  of  four  hidden  units,  the 
LMS  NN  has  a  very  big  error  in  recognizing  pattern  #5,7,13  and  14.  However  for  six  hidden  units,  it  improves  a 
little,  but  still  has  a  fairly  big  error  in  recognizing  pattern  #12,  13  and  14.  Also  we  notice  that  the  poor  performance 
occurs  at  different  patterns  for  different  cases  of  4  hidden  imits  and  6  hidden  units.  These  tests  show  that  large 
fluctuation  occur  at  the  phase  transition  critical  point  known  as  the  critical  tluctuations.|  1 1 


Figure  6.  The  performance  of  the  networks  under  noise  input  conditions.  The 
single  potential  well  and  the  double  potential  well  trained  networks  both  did 
well  for  up  to  forty  percent  noise  level  input. 


To  test  the  ability  of  fault  tolerance  in  this  quartic  potential  energy  networks,  we  add  noise  to  the  binary 
input  pattern  by  randomly  reversing  the  input  pixel  value  and  then  we  look  at  the  output  error  standard  deviation 
errors.  We  tested  a  single  potential  well  with  four  hidden  units  and  X=-\l.  Alst),  we  tested  a  double  potential  well 
with  four  hidden  units  and  3,-12.  In  both  cases  the  test  pattern  is  the  face  pattern  (pattern  1).  Both  cases  exhibit  a 
pretty  good  performance  in  recognizing  noise  patterns.  For  up  to  40  percent  noise,  the  output  standard  deviation  is 
under  0.07,  which  is  very  gr  ^d  for  Haken's  energy  function. 


6.  CONCLUSION 

Adapting  Haken's  single  order  parameter  potential  function  as  the  training  energy  on  the  back  error 
propagation  neural  network,  we  get  much  improved  performance  both  in  the  training  speed  and  recognition  of 
symmetry-breaking  and  noisy  patterns.  We  can  manipulate  tuning  parameters  to  move  double  well  attractors  further 
apart  or  closer  together  depending  on  our  model  of  the  environment.  Moving  the  tuning  parameter  t  ward  zero,  we 
will  see  an  increased  fluctuation,  this  is  believed  to  be  due  to  the  phase  transition  phenomena.  Since  the  LMS  energy 
function  is  very  close  to  the  phase  transition  point,  we  observe  an  increased  fluctuation  as  expected.  This  leads  us 
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to  believe  that  although  the  LMS  energy  is  the  most  common  used  training  energy  function,  it  is  less  desirable  to 
be  used  to  recognize  perturbed  and  noisy  patterns.  Also  by  using  the  LMS  training  energy,  the  network  becomes 
slow  and  diiTicult  to  converge.  When  X  >  0.  we  succeed  in  training  the  network  to  converge  to  both  attractors  with 
each  representing  a  pattern,  the  double  well  model  shows  a  slightly  better  accuracy  in  recognizing  perturbed  patterns 
than  single  well  energy  model,  though  have  much  better  performance  than  LMS  energy  model.  When  ^  <  0.  we 
get  a  single  potential  well  that  is  much  sleeper  than  the  LMS  single  well.  To  train  a  network  toward  its  minimum 
energy  point,  it  is  desirable  to  have  big  jumps  initially,  but  while  the  system  moves  close  to  the  minimum  point, 
the  jump  steps  become  smaller  and  smaller  until  it  converges.  This  cannot  be  done  by  adjusting  learning  rates, 
because  it  is  fixed  along  the  training.  But  by  using  this  ^  single  potential  well,  the  step  size  becomes  proportional 
to  the  slope  naturally,  through  the  8;  Eqs(3,6)  where  the  energy  slope  varies  more  than  those  of  LMS.  Therefore, 
the  gradient  descent  is  very  big  initially,  and  then  becomes  smaller  and  smaller  when  it  closes  to  the  minimum.  The 
LMS  potential  welt  has  a  much  flatter  basin  than  the  single  well,  so  its  average  gradient  descent  is  much  smaller 
than  44  single  well.  Therefore  we  observe  much  improved  performance  in  the  training  spieed  and  the  pattern 
recognitions.  Current  interestsf  1 1-14]  in  vision  ambiguity  figures  have  been  led  to  the  present  investigation  using 
the  standard  neural  network  approach  to  test  the  ability  to  resolve  the  ambiguity  under  symmetry -breaking 
perturbation  as  well  as  partial  &  noisy  imagery  input.  We  believe  that  the  double  well  potential  energy  by  Haken's 
synergetic  computing  is  a  performer  when  coupled  with  neural  networks. 
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Abstract  :  In  this  paper,  we  present  an  optimized  backpropagation  algorithm  for 
multilayer  perceptions  in  order  to  increase  speed  learning  by  improving  the  adaptability  to 
different  problems.  Thus,  we  introduced  a  new  parameter  allowing  us  to  tune  die  slope  of 
neuron  sigmoid  function.  Two  algorithms  result:  the  first  one  has  a  common  slope,  set 
before  the  network  begins  to  learn,  the  other  one  considers  all  the  slopes  as  variables  of  the 
system  and  learns  them  like  weights.  Comparisons  between  these  algorithms  and  Fahiman 
quickprop  are  performed  on  an  encoder/decoder  benchmark  and  then,  on  a  base  provided 
by  a  texture  analysis  on  sonar  images  in  order  to  recognize  sea-bed  nature.  Slope  learning 
algorithm  seems  to  be  more  efficient  in  application  where  there  is  a  sensible  evolution  of  the 
slope  value  during  the  training  phase. 


I .  INTRODUCTION 

Multilayer  peiceptrons  (MLP)  associated  with  backpropagation  algorithms  are  used  in 
many  applications  such  as  classification  or  data  compression  because  of  their  ability  in  matching 
to  the  problems  by  a  supervised  training,  and  their  adaptability  to  generalize  from  partial 
information.  Although  this  model  is  efficient  in  a  running  phase,  the  time  consumed  by  such  a 
learning  algorithm  often  grows  with  the  complexity  of  the  problem.  Thus,  our  approach  has  two 
goals  in  optimizing  backpropagation  algorithm:  increasing  learning  speed  while  using  an 
adaptive  method,  whose  settings  do  not  depend  on  the  learning  data.  We  obtain  good  results  by  a 
slope  learning  of  the  last  layer  neuron  sigmoid  function. 

In  order  to  validate  our  algorithm  whose  theory  and  development  are  given  in  [1],  we 
compare  its  performances  with  two  other  backpropagation  algorithms  (Rumelhart's  basic 
one  [2]  with  optimum  parameters  and  optimized  one  cdl^  "quickprop"  from  Fahiman  [3])  when 
running  on  two  kinds  of  problem.  The  first  one,  benchmark  proposed  by  Fahiman,  belongs  to  the 
encoder/decoder  family  problem  with  a  man-made  database,  while  the  second  one  uses  a  learning 
base  resulting  from  texture  analysis  computations  on  sea-bed  sonar  images. 


n  .  SYNTHETIC  VISION  OF  THE  ALGORITHMS 

Updates  occurred  on  a  variable  x  during  the  training  phase  may  be  summarized  with  the 
next  general  formula 

dE 

Ax(t)  =  +  tx.Ax(t-l)  (2) 

where 

nbjnttern  Nk.i-1 

•E  =  l  I  I 

^  p=l  i=0 
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•  j,  7,  is  the  error  derivative  for  x(t), 
dx(t) 

•  and  Ti,  a,  the  learning  rate  and  momentum. 

We  also  use  an  extended  definition  of  the  neuron  sigmoid  function  f  by  adding  a 
parameter  A,  that  tunes  the  slope  of  f  at  the  origin  (for  standard  neuron,  A.  =  1). 


f(t)  =  _LQ_^ 
1.0 +  e-^ 


(2) 


n.l.  Rumelhart  algorithm  [2]  : 

Only  the  weights  between  neurons  are  considered  as  variables  of  the  system,  and 
modified  according  to  equation  (1)  with  i\  as  learning  rate  and  a  as  momentum.  Each  neuron 
uses  the  same  slope  A..  With  notations  of  figure  1,  we  obtain: 


AWi^‘^*^(t)  =  -11. 


dE 


+  a.AWj^‘**'^\t-l) 


(3) 


where 


8E 


if  layer  k  is  the  output  layer 


•^^^  =  Xj^\(l-Xj^^).  ^  otherwise 

p=i 

n.2.  Fahlman  quickprop  algorithm  [3] : 

Various  optimizations  are  introduced  in  order  to  prevent  some  neurons  from  getting  stuck 

in  the  zero  state  (for  instance,  when  xj*^^  is  close  to  0.0  or  1.0  so  is  :  "flat  spot")  but  the  major 

one  consists  in  reducing  the  cost  of  Aw(t)  computation  by  using  only  the  local  second  order 
information. 

S(t) 


Aw(t)  = 


S(t-1)-S(t) 


.Aw(t-l) 


(4) 


dE 


where  S(t)  and  S(t-l)  are  the  current  and  previous  values  of  5—. 

dw 


As  Fahlman  says,  the  new  value  of  Aw  is  only  a  crude  approximation  to  the  optimum 
value  for  the  weight,  but  when  applied  iteratively,  this  method  is  surprisingly  effective. 

n.3.  Learning  slope  algorithm  [1]  : 

In  addition  to  the  weights,  our  algorithm  introduces  the  slope  of  each  neuron  sigmoid 
function  as  a  variable  of  the  network,  allowing  the  slope  to  be  updat^  in  relation  to  the  network 
evolution,  and  preventing  the  user  from  setting  the  slopes  manually.  Thus,  we  obtain  for  the 
slopes  updates: 


AXf'\t)  =  -v. 


dE 


d^\i) 


+  itAAfVl) 


(5) 
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where 


dE 


=s| 


•  8,®  =  X®.(l-X,*>).(Si-X,<'‘') 

Nk^l 

•8,®  =  x,»>.(1-x»»).2  VSp**"- 

P=I 


W; 


(kJc+1) 

ip 


if  layer  k  is  the  output  layer 
otherwise 


HI .  Experiments  and  results 

Stochastic  gradient  is  effectively  employed  in  basic  and  learning  slope  algorithms  (i.e. 
network  variables  are  updated  after  each  pattern  presentation,  using  an  approximation  Ep  of  the 
global  energy  E,  to  calculate  the  errors). 

^  i  =  0 

Two  experiments  are  achieved  to  compare  the  results  brought  by  our  algorithm,  with 
basic  and  quickprop  algorithm.  The  first  one  based  on  an  encoder/decoder  problem  completely 
defined,  is  used  by  Fahlman  in  his  paper  [3]  by  way  of  benchmark.  The  second  one  resulting  of  a 
less  artificial  approach,  applies  all  these  ^gorithms  on  patterns  provided  by  a  texture  analysis  on 
sonar  images. 

Many  computations  are  needed  to  compare  the  performances  of  these  algorithms  for  a 
given  problem.  First,  we  prepared  10  different  weight  initializations  in  order  to  make  a  series  of 
learning  with  the  same  initial  states.  Then  for  each  algorithm,  we  go  over  all  its  coarsely 
discretized  parameter  space  in  order  to  find  the  optimum  combination  that  provides  the  fastest 
series  of  learning. 

in.l.  Fahlman  benchmark :  M-N-M  encoder  (10-5-10  encoder) 

A  neural  network  encoder  consists  in  a  3-layer  network  for  which  output  results  are  the 
same  as  input  patterns.  Then  the  encoded  information  is  recovered  by  looking  at  the  values  of  the 
hidden  layer  (bottleneck  of  the  network).  The  artificial  training  base  is  resized  by  a  set  of  M 
patterns  'u': 

{i€[0,M-l]  ,  UicR*^  /  Ui=(xo,...,XM-i)),  xo=...=Xi.i=Xi+i=...=XM-i=0.0,  Xi=1.0} 

We  implemented  quickprop  algorithm  and  what  we  obtained  corroborate  Fahlman  results. 
The  hyperbolic  arctangent  error  function  was  not  used  in  this  algorithm  but  the  standard  sum-of- 
squares  one.  End  criterion  based  on  a  minimal  distance  (here  0.3)  between  network  outputs  and 
wished  results  is  preserved  and  applied  to  the  other  algorithms. 


Basic 

Rumelhart  Ah 

•orithm 

h 

a 

Initial  X 

Max 

Min 

Ave 

S.D. 

0.6 

0.75 

2.0 

36 

17 

24.3 

7.27 

1  Quickprop  Algorithm 

h 

a 

Range 

Max 

Min 

Ave 

S.D. 

1.5 

1.75 

2.0 

72 

13 

22.1 

8.9 

Slope  Learning  Ai 

gorithm 

r) 

a 

V 

K 

Initial  X 

Max 

Min 

Ave 

S.D. 

1.0 

0.75 

0.4 

0.75 

0.75 

50 

27 

37.6 

7.29 

We  observe  that  a  classical  MLP  with  optimum  parameters  including  a  fixed  slope 
obtains  quite  the  same  results  as  the  quickprop  algorithm.  The  impact  of  setting  the  slope  is 
obvious  since  its  optimum  value  is  2.0. 

In  this  case,  our  algorithm  is  slower  than  the  other  ones  because  it  has  not  enough  time  to 
learn  the  slope.  This  phenomenon  is  due  to  the  weak  slope  variation  during  the  learning  phase  as 
it  is  shown  in  figure  2.a  (slope  range  in  [2.4,3.0]).  Therefore  for  this  specie  problem,  the  initial 
slope  setting  correspond  to  Ae  problem  constant  optimum  slope  so  much  so  that  learning  slope 
do  not  really  improve  the  speed  training  phase.  Even  if  crude  performances  are  not  so  high  as 
other  ones,  we  observe  a  quite  good  homogeneity  of  the  training  speeds  that  may  highlight  a 
better  endurance  stability.  All  that  remarks  must  be  confirmed  by  an  application  on  more 
complex  and  realistic  problems. 

in.2.  Learning  sea-bed  natures  according  to  texture  analysis  parameters 

In  order  to  validate  more  efficiently  our  algorithm,  we  used  a  database  constituted  by 
patterns  stemmed  from  parameters  that  characterize  the  sea-bed  natures  by  a  texture  analysis 
applied  to  sonar  images.  Practically,  we  extract  290  sub-images  from  data  collected  by  sonar  [4], 
each  image  representing  only  one  Idnd  of  bottom  among  four  classes;  dunes,  ripples,  sand  and 
stones  (figure  3).  Many  methods  allow  us  to  discover  structural  information  on  the  texture  of 
these  images  [5].  Cooccurrence  matrices  are  well  known  concentrating  spatial  organization  of 
pairs  of  pixels  in  a  texture,  according  to  a  polar  shifting  vector  (distance,  angle).  Thus,  we 
calculate  9  matrices  on  each  sub-image,  each  matrix  corresponding  with  a  value  for  the  distance 
of  the  shifting  vector  (from  1  to  9)  in  four  privileged  directions  [6].  To  characterize  the  properties 
of  a  matrix,  we  compute  6  parameters  such  as  homogeneity,  entropy,  correlation, ...  which  allow 
quite  good  discrimination  between  textures. 

Finally,  the  3-layer  network  used  for  this  application,  has  54  neurons  for  the  input  layer  (a 
learning  pattern  is  a  S4-dimensional  vector),  25  neurons  for  the  hidden  layer  and  4  for  the  output 
layer  (4  classes  must  be  learned).  The  training  is  realized  with  10  patterns  of  each  class  (on  the 
whole  40  patterns),  randomly  chosen  among  the  base. 
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Unfortunately,  we  were  not  able  to  obtain  any  results  for  the  quickprop  algorithm  because 
its  state  did  not  satisfy  the  end  criterion  after  200  epochs,  despite  several  runs  for  each  parameter 
combination.  Optimum  slope  for  Basic  Rumelhart  Algorithm  is  still  different  from  1.0;  it 
emphasizes  the  usefulness  of  such  a  parameter.  The  major  interest  of  this  application  comes  from 
the  wide  evolution  of  the  slope  values  during  training  phase  (between  0.35  and  1.5).  The 
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figure  2.b  shows  a  training  phase  where  slopes  were  initialized  to  1.8  in  order  to  present  the 
impact  of  such  a  learning  (see  the  fall  of  the  global  error  when  slopes  begin  to  be  learnt). 

IV .  CONCLUSION 

The  stability  of  this  algorithm  for  a  series  of  training  (weak  standard  deviation  value), 
constitutes  an  interesting  characteristic  provided  by  the  slope  learning,  and  proves  its  better 
adaptability  to  different  pattern  distributions. 

Figure  2.b  obviously  shows  that  each  neuron  requires  a  different  slope  variable.  Learning 
these  slope  variables  allows  us  to  set  only  two  parameters  (learning  rate  &  momentum)  instead  of 
one  initial  slope  value  per  last  layer  neuron.  The  profit  is  peculiarly  appreciable  when  the  number 
of  output  classes  increases.  We  also  observe  that  learning  slope  do  not  need  a  very  fine  tuning  of 
its  two  parameters. 

Resulting  in  training  speed  optimization,  this  algorithm  offers  an  other  approach  to  a 
more  adaptive  learning  strategy. 
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VI .  FIGURES 

K  hidden  layers  (0, 1, ....  K,  K+1). 

Nk  ;  Number  of  neurons  in  layer  k 
nk,i  :  Neuron  i  of  the  layer  k 
Si  :  Wished  output  for  ni,K+i 

w^+i)  .  Weight  of  neuron  between  njj^  and  nj 


Figure  1  :  Notations  for  algorithms 
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Figure  2  :  Last  layer  neuron  slopes  &  global  error  evolution  during  training  phase 

a.  :  Encoder/decoder  problem 

b.  ;  Sea  bed  texture  parameters 


Figure  3  ;  Sea-bed  sonar  sub-images  (dunes,  stones,  sand,  ripples) 
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Abstract 

Gradient  based  techniques  are  suited  to  optimisation  problems  but  existing  techniques  using  gradient  descent 
surfaces  for  iteural  networks  are  slow  and  do  not  scale  up  well  as  the  problem  size  increases. 

We  argue  that  in  the  case  where  descent  surfaces  are  used,  slowness  arises  due  to  the  weight  uansitions  having 
weakly  controlled  directional  properties.  We  present  here  an  approach  for  feedforward  networks  which  does  not 
use  descent  surfaces.  Instead,  tangent  hyperplanes  are  used  with  subgoals  to  provide  strongly  controlled 
directions  for  weight  transitions.  The  extra  control  ctMnes  firmn  best-fit  qiproximatioQS  to  a  set  of  local  solution 
manifolds  which  ate  emnputed  directly  using  a  linear  sdution  system.  The  technique  is  fully  automated  with  no 
critical  problem  dqtendent  parameters. 

Results  (m  the  benchmarks  of  XOR  aitd  the  2-spirals  problem  show  substantial  imiarovemenu  in  feasibility, 
robustness,  and  training  ^teed  when  compared  to  descent  techniques  such  as  back-propagation  and  conjugate 
gradiem  descent 


1.  latrodnction 

Gradient  descent  techniques  are  conummly  used  to  train  feedforward  networks.  However,  these  techniques  have 
been  found  to  be  slow  and  sometimes  unreliable  e^iecially  for  larger  problems  [1]. 

With  these  techniques,  a  goal  weight  state  is  typically  viewed  as  a  minimum  of  an  error-weight  surface.  In  orda 
to  be  effective,  gradient  descent  methods  require  their  travel  surfaces  to  be  regular  to  various  degrees  in  directions 
towards  such  states.  For  a  steepest  descent  method  such  as  standard  back-pit^gation  [2],  the  regularity  takes 
the  form  of  circular  bowls  or  Ihiear  troughs.  More  sophisicaied  descent  metb^  can  rely  on  less  regularity  to 
make  for  a  benign  surface.  When  momentum  is  used  with  back-propagation  for  example,  oscillatory  directions 
may  be  sui^rressed  [2],  though  only  to  a  limited  extent.  Another  technique.  Conjugate  Gradient  Descent, 
assumes  the  tra.el  surface  is  an  approximation  to  a  quadratic  surface.  While  there  are  clear  improvements  in 
training  speed  for  this  technique  [3],  the  improvements  in  general  have  not  been  substantial  enough  for  the 
method  to  be  seen  as  overcoming  the  problem  of  slow  or  infeasible  training  for  large  training  sets. 

We  would  argue  that  the  reason  for  hostile  surfaces  occurring  lies  in  the  way  the  surfaces  are  generated  for 
feedforward  nets.  For  each  1-0  pattern  there  is  a  gradient  vector  pointing  in  the  direction  of  steepest  descent  for 
the  respective  error-weight  surface.  These  vectors  are  summed  to  produce  an  overall  vector  for  an  overall  error- 
weight  surface.  Although  we  have  found  empirically  that  each  component  vector  provides  an  accurate  direction 
to  a  surface  minimum,  the  vector  sum  is  not  nearly  as  precise. 

In  fact,  there  is  no  theoretical  geometric  basis  for  a  vector  sum  to  point  at  the  goal.  That  is,  the  number  and 
direction  of  the  component  vectors  are  arbitrary  as  a  set  in  relation  to  what  is  required  for  their  sum  to  point 
correctly.  Consequently,  we  suppose  that  a  weak  point  in  gradient  descent  methods  is  the  way  the  individual 
vectors  are  combined.  We  provide  instead  a  method  of  combining  individual  gradient  information  that  is  not 
based  on  the  vector  sum  and  has  a  geometrically  grounded  tdrility  to  point  towards  desired  goals. 
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Z.  Taagcatlal  Solatioa 

In  the  inooduction,  a  goal  weight  state  was  viewed  as  a  minimum  of  an  overall  error-weight  surface.  Figure 
1(a)  provides  an  example  of  this  view  for  a  linear  net  with  a  single  minimum.  A  different  perspective  will  be 
taken  here.  For  each  individual  I-O  pattern  there  is  a  set  of  weight  states  that  will  have  zero  error.  We  -vill  call 
such  a  set  a  solution  manifold.  This  suggests  another  view  of  a  goal  weight  state  as  a  point  which  has  least 
error  in  the  sense  of  being  closest  to  all  the  solution  manifolds.  If  there  is  a  conunon  intersection  of  all  the 
solution  manifolds  then  the  solution  will  be  exact,  i.e.  the  set  of  I-O  patterns  will  have  no  error  at  the 
intersection.  Figure  1(b)  illustrates  this  for  a  linear  net  Otherwise  the  solution  will  be  inexact  The  goal  stales 
from  the  manifold  and  surface  views  are  the  same  if  the  error  is  measured  as  the  distance  away  from  the  solution 
manifolds. 


Figure  1. 

Two  views  of  a  goal  weight  state  for  a  linear  2’I  net  (without  bias)  that  has  an  exact  solution. 

a)  The  contours  of  an  error-weight  surface  with  the  goal  indicated  by  G. 

b)  The  individual  solution  manifolds  with  the  goal  at  their  intersection. 

The  approach  taken  here  has  to  deal  with  a  set  of  non-linear  rather  than  linear  solution  manifolds  for  non-linear 
feedforward  networks  with  hidden  units.  This  will  be  done  by  taking  a  linear  approximation  to  the  non-linear 
manifolds  for  sufficient  iterations.  We  will  restrict  ourselves  to  providing  the  simplest  general  method  for 
feedforward  networks.  A  single  hidden  layer  is  sufficient  for  any  I-O  mapping  [4]  and  we  adopt  this  architecture. 

A  further  restriction  will  be  to  confine  the  methodology  to  networks  with  a  single  ouqait  unit  initially  before 
considering  multiple  ou^t  units.  Each  solution  manifold  for  this  type  of  network  has  dimension  (N-1)  in  an 
N-D  weight  qjace.  The  linear  approximation  to  the  manifold  at  some  point  on  it  is  then  the  tangent  hyperplane 
at  that  point.  The  set  of  such  hyperplanes  constitutes  a  linear  approximation  to  the  set  of  solution  manifolds.  A 
point  which  is  closest  to  the  set  of  hyperplanes  in  terms  of  the  total  distance  away  from  the  hyperpianes  is 
therefore  this  set's  approximation  to  a  goal  weight  state.  This  tqrivoxiination  will  be  accurate  if  the  initial 
weight  stale  is  close  enough  to  the  goal.  The  degree  of  closeness  required  is  directly  related  to  the  degree  of 
curvature  the  solution  manifolds.  The  more  non-linear  they  are,  the  closer  the  goal  has  to  be  ftv  a  good 
approximation  to  be  expected. 

In  order  to  guarantee  a  good  approximation  then,  a  desired  state  needs  to  be  nearby.  Yet  the  goal  which  is  the 
solution  to  the  user  problem  may  only  have  states  which  ate  far  away.  Consequently,  a  nearby  subgoal  is 
attempted  with  the  chain  of  subgoals  u^  leading  to  the  final  goal.  In  this  way,  a  good  initial  ^rproximaiion  to 
the  goal  weight  state  is  not  needed  by  the  method.  A  subgoal  weight  state  is  similar  to  the  final  goal  weight 
state  in  that  although  it  has  a  known  desired  ouqrut  state,  its  position  in  weight  space  is  unknown  until  it  is 
achieved.  It  differs  from  the  final  goal  in  that  it  may  be  set  to  be  closer  to  the  current  state  than  any  goal  weight 
state.  The  requhements  for  a  subgoal  are  that  it  must  be  close  enough  to  the  present  weight  state  in  (»der  lo 
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generaie  a  good  linear  approxiination  to  tbe  subgoal  and  yet  also  far  enough  away  to  allow  significant  progress 
to  be  acbieved  towards  ibe  goal. 


2 . 1  Sabgoal  aiming 

As  mentioned  in  tbe  introduction,  tbe  gradient  vector  for  an  individual  1-0  pattern  provides  an  accurate  direcdon 
to  a  minimum  in  tbe  pattern's  surface.  A  point  WscP  on  tbe  solution  manifold  for  a  subgoal  1-0  pattern  P  is 
ihwfbre  fouito  using  line  search  in  tbe  direction  of  tbe  individual  1-0  pattern's  gradient  vector  from  tbe  current 
weight  state  Wq  (Figure  2(a)).  The  aim  is  to  use  such  points  to  approximate  Wsc  (Figure  3). 


Figure  2. 

In  a)  the  subgoal  contour  or  solution  manifold  found  is  thickened. 

In  b)  a  contour  at  half  the  distance  relative  to  the  existing  subgoal  contour  is  selected  as  being  the  next  rabgc 
solution  mangold. 

Tbe  tangent  hyperplane  approximating  tbe  solution  manifold  at  WsCP  is  then  determined.  The  set  of  such 
hyperplanes  for  all  the  patterns'  solution  manifolds  constitutes  a  linear  approximatkm  to  tbe  latter.  The  linf^ 
equations  corresponding  to  tbe  set  of  hyperplanes  may  be  solved  using  Singular  Value  Decomposition  [5]  SVD 
provides  a  candidate  solution  weight  state  WsvD  in  botb  tbe  exact  and  the  inexact  cases  (Figure  3).  In  tbe 
inexact  case,  SVD  yields  an  optimal  solution  in  Least  Mean  Square  terms.  That  is,  the  solution  has  tbe  Least 
Mean  Square  error  where  error  is  measured  as  the  total  distance  away  from  the  tangent  hyperplanes. 


FIgnrc  3. 
Si^oal  aiming. 
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2.2  Svbgoal  ttstlog,  scttieg,  and  chateing 

The  weight  state  WsvD  bas  data  associated  with  it  that  is  used  in  testing  whether  the  subgoal  it  is  aimed  at  is 
close  enough  to  Wc-  Besides  the  weight  state  itself,  there  is  the  actual  output  produced  by  WsvD  in  response 
to  the  training  inputs.  There  is  also  the  subgoal  output  being  targetted.  This  subgoal  output  is  the  output 
produced  by  each  WscP  in  response  to  the  training  inputs  for  each  pattern  P.  The  aim  is  to  see  if  such  output 

can  be  found  in  a  single  weight  slate  nearby  with  WsvD  being  the  candidaie.  The  data  is  tested  using  three 

heuristic  criteria.  The  first  two  criteria  need  to  be  satisfied  for  (^5(7  to  deemed  to  be  close  enough.  We  have: 

(1) 

where  w/  is  the  ilb  component  of  a  weight  state,  and  Lh*  is  a  constant  real  value,  and  also 

V/*;|ocf  ■"  Osci»|  ^  ® 

where  OcP  and  OsCP  are  the  current  and  subgoal  output  values  respectively  for  a  pattern  P,  and  Lq  is  a 
constant  re^  value. 

If  the  first  two  criteria  are  satisfied,  the  third  criterion  tesu  the  closeness  achieved  by  a  candidate  weight 
transition  for  progress.  There  are  two  alternatives  for  satisfaction. 

^  y.(^rp(f "i* 0 ~f^sgp(0)  (3) 

P  P 

2(t>cp('  + 1)  -  OsGpit  + 1))*  <  M  (4) 

p 

where  the  times  r  and  r-f  /  refer  to  the  beginning  and  end  the  weight  transition  respectively,  and  A#  is  a  small 
constant  set  a  priori  by  the  user. 

The  first  alternative.  (3).  is  a  test  for  progress  adiieved  towards  the  goal.  If  this  alternative  is  satisfied,  the 
candidaie  subgoal  is  acceptable  and  is  set  as  the  subgoal.  When  progress  has  not  been  achieved,  the  candidate 
subgoal  is  failed.  There  then  remains  the  decision  as  to  whether  to  seek  a  new  candidate  for  the  existing 
subgoal.  If  the  second  alternative  fitiis,  this  signals  that  there  is  still  potential  progress  to  be  had  and  so  a  new 
candidate  is  sought.  If  the  alternative  is  smisfied,  the  potential  progress  is  too  small  to  be  worth  pursuing 
further.  The  candidate  is  set  as  the  subgoal,  and  a  weight  transition  is  triggered  to  move  the  process  on  and  seek 
anew  subgoal. 

The  first  candidate  subgoal  for  being  set  is  the  goal  itself  in  case  it  is  close  enough.  The  subgoal  aiming 
procedure  described  above  is  invoked  and  a  candidate  solution  weight  state  attempting  the  candidate  subgoal  is 
found.  This  subgoal  is  tested  using  the  data  from  the  attempt  on  it,  aixS  the  three  heuristic  criteria,  for  being 
close  enough. 

When  the  subgoal  is  not  close  enough  and  b  failed,  we  halve  the  distances  involved  in  the  subgoal  aiming.  In 
the  first  instance,  the  distances  between  Wc  aid  the  solution  manifolds  are  halved  and  the  target  ouqwt  reset 
according  to  the  ou^b  found  a  this  distance  (Figure  2(b).  Thb  resetting  provides  a  new  candidate  subgoal  in 
output  terms. 

The  distance  bisectkm  process  b  repeated  until  a  candidaie  subgoal  and  solution  weight  state  are  found  that 
satisfy  the  heurbtic  criteria.  A  subgoal  b  then  deemed  to  have  been  set  Also,  the  solution  state  b  taken  to  be 
the  single  attempt  made  on  thb  subgoal.  One  iientfioo  in  our  method  has  then  been  conqdeted.  Anewsubgoal 
b  now  set  using  the  goal  as  a  starting  candidate  again  and  the  process  repeated  until  the  resultant  subgoal  chain 
convm-ges  the  weight  state  to  be  sufficiently  close  to  the  goal  in  output  terms. 
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In  the  descnptkm  of  the  method  above,  each  time  a  new  candidate  subgoal  is  aimed  at  during  the  setting  of  the 
next  subgoal,  we  theoretically  have  to  compute  a  completely  new  set  of  tangent  byperplanes  to  detennine  a  mw 
^SVD-  I"  practice  though,  we  compute  only  one  set  of  hyperplanes  from  scratch  for  this  stage. 

When  the  subgoal  is  close  enough,  there  is  not  only  a  good  match  between  the  linear  and  non-linear  solution 
but  also  a  good  degree  of  parallelism  in  the  contours.  Consequently,  we  may  suppose  the  tangent  byperplanes 
at  the  candidate  subgoal  solution  manifolds  to  be  parallel  in  the  direction  of  the  gradient  vector  for  an  I-O  pattern 
without  loss.  The  hyperplane  which  is  orthogonal  to  the  ^adient  direction  at  Wc  is  computed  once  and  then 
translated  to  each  candidate  subgoal  solution  manifold  as  required. 

This  simplification  also  gives  us  another  major  benefit.  Since  we  set  the  subgoals  for  each  pattern  at  the  same 
fractional  distance  from  their  solution  manifolds,  all  candidate  subgoal  attempts  lie  on  the  line  connecting  Wc 
and  WsvD  (see  Figure  3)  where  WsvD  is  the  state  found  by  taking  the  goal  as  the  candidate  subgoal.  Therefore 
we  only  need  to  use  SVD  once  per  iteration  to  ctMnpute  the  direction  in  which  the  subgoal  attempts  lie.  The 
fractional  distance  being  used  thra  completes  the  determin^ion  of  the  position  of  each  attempt 

3.  Experiments 

We  present  results  comparing  our  technique  with  standard  back-propagation  (using  momentum)  for  the  common 
benchmarks  of  the  XOR  problem  and  the  2  spirals  problem  [1].(6].  We  suggest  that  the  results  reflect  an  ability 
to  find  a  good  goal  direction  not  present  in  gradient  descent  methods.  The  low  number  of  iterations  needed  for 
each  problem  together  with  an  insignificant  failure  rate  indicate  the  robustness  of  the  technique. 

The  2  spirals  problem  represents  a  bridge  between  our  version  of  the  minimal  jntAlem  of  XOR  and  real  wcvkl 
problems.  In  particular,  it  can  be  used  to  give  an  indication  of  bow  the  method  will  scale  up.  We  successfully 
attempted  this  problem  with  a  fixed  2-SO-l  architecture.  This  architecture  is  unsuited  to  deal  with  such  a  non¬ 
linear  proUem,  at  least  as  far  as  gradient  descent  is  concerned.  We  could  not  find  a  solution  weight  state  for  this 
single  layer  ardiiiecture  using  standard  badc-propegatioa. 

The  same  failure  to  train  is  reported  by  Baum  &  Lang.  They  were  in  fact  unable  to  find  a  solution  using  either 
standard  back-propagation  or  conjugate  gradient  methods  even  when  they  used  a  larger  2-60-1  architecture.  Lang 
&  Witbrock  managed  to  solve  the  problem  using  a  jumped  2-S-S-5-1  architecture,  but  also  reported  failure  when 
training  with  architecoires  with  fewer  hidden  layers. 

These  failures  for  the  2-spiials  problem  with  oonveniiooai  gradient  descent  methods  and  our  zero  failure  rate  lead 
us  to  the  conclusion  that  our  method  scales  up  relatively  better.  We  see  the  significance  of  our  results  in 
showing  the  method  to  be  powerful  in  flnding  directions  in  networks  not  especi?  suited  to  deal  with  a 
problem. 

The  tolerance  mentioned  in  the  parameters  for  standard  back-ixopagation  represents  the  maximum  acceptable 
difference  between  output  and  target  for  terminating  training.  The  parameter  settings  for  heuristics  are  shown  to 
not  be  critically  problem  dependent  here  by  choosing  a  common  setting  for  both  the  XOR  and  the  2-spirals 
problems.  We  set  to  I.O,  Lq  to  0.1,  and  M  to  0.(X)1.  The  iterations  indicate  the  number  of  direction 
changes  made  and  are  not  otherwise  comparable  to  those  of  standard  back-propagation  as  cmiqMtations. 

Table  1.  XOR  Problem 


Parameters 

Trials:  1000;  Output  tolerance:  0.01;  Targets;  0.8, 0.2;  Initial  weight  range:  [-1,  -fl];  momentum  (for  SBP): 
0.9 
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Table  2.  2-Spiraia  Problein 


Method 

Learning 

rate 

liaininK  iterations 

Failure 

rate 

Av.  Real 

Tune 

average 

std.  deviation 

TP 

- 

445 

209 

0 

253.69 

Paraoieters 

Trials:  10;  Output  tolerance:  0.33;  Targets:  0.95. 0.OS;  Initial  weight  range:  [-1, -t-l]. 


4.  Multiple  Output  Units 

Tbe  method  described  as  the  basis  for  the  demonstrator  experiments  is  suited  to  networks  with  single  output 
units.  In  particular,  it  is  based  on  the  solution  manifolds  having  a  dimensionality  (N-1)  in  an  N-D  weight 
space.  The  solution  manifold  for  networks  with  multiple  output  units  is  (N-r)-D  though,  witb  r  being  tbe 
number  of  output  units.  The  method  may  be  extended  to  cope  with  this  difficulty  relatively  straightforwardly, 
since  such  a  solution  manifold  may  be  seen  as  tbe  intersection  of  r  (N-l)-D  solution  manifolds  derived  from  tbe 
r  combinations  of  an  I/O  pair  and  each  output  unit.  Consequently,  a  tangent  hyperplane  may  be  computed  for 
each  of  tbe  r  output  units.  These  sets  of  byperpianes  constitute  tbe  linear  approximation  to  the  non-linear 
system  of  equations  required  by  the  ^iproach. 


5.  Conclusion 

A  geometrical  basis  for  finding  an  optimum  combination  of  gradient  vectors  has  been  given  using  tangent 
hyperplanes  and  subgoals.  The  method  is  seen  to  provide  strongly  controlled  directioning  resulting  in  a  lower 
number  of  direction  changes  during  training  and  lower  failure  rates.  The  method  has  also  indicated  a  good 
scaling  up  through  its  solutions  to  the  2-spirals  problem. 

The  approach  provides  a  training  algorithm  for  feedforward  networks  with  single  hidden  layers  and  hence  is 
aq)able  of  providing  any  I-O  mapping.  Nevertheless,  some  problems  may  be  better  solved  using  networks 
having  more  than  one  hidden  layer  and  this  extension  is  currently  under  further  investigation. 


6.  References 

[1]  •  Baum,  E.B.,  Lang.  KJ.  Constructing  Hidden  Units  using  Examples  and  Queries.,  In  "Neural  Infmmation 
Processing  Systems  3  1990",  pp  904-910, 1991. 

[2]  •  Rumelhart,  D.E.,  McClelland  J.L.  and  tbe  PDF  Research  Group.  Parallel  Distributed  Processing.  Vol.  I. 
The  MIT  Press,  Cambridge.  1986. 

[3]  -  Johanson,  E.M.,  Dowla,  F.U.,  Goodman,  DM.  Back-propagation  Learning  for  Multilayer  Feed-Forward 
Neural  Networks  using  the  Conjugate  Gradient  Method.  International  Journal  of  Neural  Systems,  Vol.  2,  No.  4 
pp  291-301,  1992. 

[4]  -  Homik,  K.,  Stinchcombe,M.  and  White,H..  Multilayer  Networks  are  Universal  Approximators,  Neural 
Networks.  2  pp  359-366,  1989. 

[51  -  Press,  W.H.,  Teukolsky,  S.A.,  Vetterling,  W.T.,  Flannery,  B.P.  Numerical  Recipes  -  The  Art  of  Scientific 
Computing,  2nd  ed.  Camtmdge  Univosity  Press,  1992. 

[6]  -  Lang.  K.  J.  and  Witbrock,  M.  J.  Learning  to  Tell  Two  Spirals  Apart,  Proceedings  of  tbe  1988 
Connectionist  Models  Summer  School,  Morgan  Kaufinann,  pp  52-59, 1988. 


ra-443 


Soft-Monotonic  Error  Functions 


Martin  M0ller  and  Jan  Depenau 

Tenna  Elektronik  A/S,  Hovmairken  4,  DK-8520  Lystrup,  Denmark 

and 

Computer  Science  Department,  Arhus  University, 

Ny  Munkegade,  Build.  540,  DK-8000  Arhus  C,  Denmark 


Abstract 

It  is  well  known  that  the  least  mean  square  error  function  ani  the  entropy  error  function  are 
Bayes  optimal.  Satisfying  Bayes  optimality  criteria  does  not  give  any  information  about  con¬ 
vergence  properties,  trajectories  in  weight  space  (e.g.,  if  training  often  leads  to  local  minima  or 
flat  regions  in  weight  space),  or  generalisation  ability  when  trained  on  smaller  sets  of  data.  The 
problem  with  these  error  functions  is  that  they  are  not  monotonic  with  respect  to  classification, 
i.e.,  minimisation  of  the  error  functions  does  not  imply  minimization  of  misclassiflcations. 

This  pi^ier  proposes  two  new  error  functions,  that  exhibits  a  form  of  soft-monotonicity, 
where  the  monotonic  behavior  is  dependent  on  the  values  of  certain  parameters  associated 
with  the  functions.  Through  several  experiments,  it  is  shown  that  these  functions  can  improve 
convergence  and  generalization. 


1  Introduction 

Error  functions  like  least  mean  square  and  cross-entropy,  are  known  to  be  Bayes  optimal  in  the  sense 
that  minimization  with  these  functions  produce  solutions  that  approach  the  greatest  lower  bound 
on  generalization  error  as  the  trzdning  set  approaches  infinity.  But  when  the  training  set  is  small 
this  approximation  can  be  poor  (Buntine  91],  and  it  is  sparse  it  is  necessary  to  impose  constraints 
on  the  network  solutions.  This  is  in  a  Bayesian  perspective  the  same  as  choosing  appropriate  priors 
which  is  strongly  related  to  penalty  terms  or  regularizers  in  statistical  literature. 

The  problem  with  the  least  mean  square  error  function  can  be  illustrated  by  the  following 
simple  figure  [Hampshire  92].  Consider  a  network  with  two  output  units  having  output  between  0 
and  1.  The  outputs  are  mapped  onto  the  z-  and  y-axis  respectively.  If  the  desired  target  pattern 
is  (1  0)  then  all  outputs  to  the  right  of  the  Une  y  =  z  can  be  considered  correct.  If  and  only  if 
the  contours  of  equal  error  are  straight  lines  parallel  to  y  =  z,  then  there  exist  no  regions  with 
misclassification  and  lower  error  than  other  regions  with  correct  classification.  Hampshire  defines 
error  functions  that  satisfy  such  a  condition  to  be  monotonic.  Hampshire  strongly  suggests  that 
non- monotonic  behavior  in  training  can  be  the  cause  for  the  often  seen  “overlearning”,  i.e.,  where 
the  recognition  performance  on  a  disjoint  test  set  peaks  and  then  degrades,  while  training  set 
performance  continues  to  improve. 

The  problem  with  suboptimal  solutions  exists  in  the  form  of  local  minima,  which  in  practice 
often  are  very  flat  regions  in  error  space.  Suboptimal  solutions  in  flat  regions  are  often  characterized 
by  having  a  few  patterns  classified  very  wrong  and  many  correct.  The  regions  are  flat  because  the 
network  gradient.'  are  small  for  extreme  wrong  outputs.  Minimization  of  the  least  mean  square 
error  function  might  very  well  converge  to  such  regions  because  the  training  algorithms  are  greedy 
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Figure  1:  Illustration  of  non-monotonicity.  The  x-axis  is  the  output  from  the  first  unit  and  the 
y-axis  is  the  output  from  the  second.  The  curves  corresponds  to  re^ons  with  equal  least  mean 
squue  error  on  target  pattern  (1  0).  Clearly,  there  are  regions  where  the  network  misclassifies,  but 
where  the  error  is  lower  than  in  other  regions  where  the  network  classifies  correctly.  For  example, 
the  error  in  the  point  xi  is  lower  than  the  error  in  X2-  For  a  monotonic  error  function,  the  contours 
would  have  to  be  straight  lines. 


algorithms,  updating  weights  in  the  direction  of  fastest  error  decrease,  and  no  mechanism  in  the 
error  function  prevents  the  update  of  weights  into  these  regions. 

This  paper  defines  two  new  error  functions  that  satisfies  a  soft-monotonic  condition  in  the  sense 
that  the  functions  are  asymptotically  monotonic  in  the  limit  for  certain  parameters  associated  with 
the  functions. 

2  Imposing  constraints  on  network  solutions 

Instead  of  insisting  on  strict  monotonicity,  we  can  define  error  functions  that  satisfy  a  soft- 
monotonic  condition,  where  a  certain  parameter  controls  the  degree  of  monotonicity.  The  main 
idea  is  to  incorporate  appropriate  constraints  into  the  error  function,  so  that  the  weights  are  con¬ 
strained  away  from  bad  regions  in  weight  space. 

A  way  to  avoid  suboptimal  solutions  is  to  strictly  minimize  the  number  of  misclassifications. 
Hampshire  defines  such  an  approach  that  works  for  binary  classification  problems  [Hampshire  92]. 
We  present  a  more  general  approach  that  involves  a  soft  minimization  of  misclassifications. 

Since  good  solutions  are  characterized  not  only  by  low  averse  error  but  also  by  having  as  many 
patterns  with  low  error  as  possible,  a  good  idea  would  be  to  include  both  terms  in  the  error  function. 
One  possible  approach  is  to  define  an  error  function  that  penalizes  errors  of  large  magnitude. 

E{w)  =  —  ^ 
p,j 

where  a  and  /?  are  positive  parameters.  The  derivative  to  (1)  with  respect  to  a  given  Opj  is 

=  -oi{tpj  -  (2) 


(1) 
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Figure  2:  The  function  of  the  a  and  parameter. 


It  is  easy  to  see  that  the  global  minimum  for  (1)  is  when  tpj  =  Opjy  The  function  of  a  and  yd 
is  illustrated  through  figure  2.  /d  defines  the  width  of  the  acceptable  error  around  the  desired  target 
and  a  controls  the  steepness  of  the  exponentially  growin.r  error  in  the  penalized  regions  outside  the 
interval.  If  a  is  small  equation  (2)  resembles  the  derivative  of  the  least  square  function.  But  the 
higher  a  gets  the  more  active  is  the  constraint  imposed  on  the  penalized  regions.  When  no  errors 
are  in  the  penalized  regions  /d  is  decreased,  so  that  the  outputs  are  pulled  towards  the  targets.  A 
high  a  value  gives  large  partial  error  derivatives  inside  the  penalized  regions  and  small  partial  error 
derivatives  when  outside  the  regions.  So  the  higher  the  a  value  the  more  the  errors  will  tend  to 
arrange  themselves  inside  the  region  around  the  target.  This  gives  a  sort  of  balanced  distribution 
of  the  errors.  For  regression  problems  it  is  well  known  in  statistics  that  a  balanced  set  of  errors  can 
yield  better  generalization,  this  is  often  referred  to  as  variance  heterogeneity  [Seber  and  Wild  89]. 
It  is  an  open  question  whether  this  is  true  also  for  classification  problems. 

In  the  limit  when  a  increases  to  infinity,  the  exponential  error  function  is  monotonic.  Surely, 
for  a  fixed  number  of  patterns  in  the  training  set,  we  can  select  a  large  enough  a  so  that  the 
error  function  is  monotonic.  The  problem  is  how  large  a  should  be  to  ensure  monotonicity  in  a 
given  problem.  Selecting  too  high  a  a  slows  down  the  convergence,  because  of  too  hard  constraints 
imposed  on  the  acceptable  paths  down  to  the  minimum.  On  the  other  hand,  too  small  a  a  results 
in  non-monotonic  behavior  of  the  error  function.  One  promising  approach  would  be  to  adapt  a 
similarly  to  the  penalty  parameters  in  constrained  optimization,  starting  with  a  small  a  and  then 
successively  increasing  a  during  training.  This  approach  has  not  been  tried  yet.  It  seems  that  just 
setting  a  to  a  “reasonable”  size  yields  good  results. 

Notice  that  the  exponential  error  function  also  works  for  non-classification  problems  and  that 
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the  soft-monotonicity  condition  can  be  obtmned  for  any  accuracy  required  by  adjustment  of  the  0 
parameter. 

A  more  direct  way  of  balancing  errors  is  to  minimize  the  variance  of  the  magnitude  of  the  errors. 
This  can  be  done  by  adding  the  variance  as  a  penalty  term  to  an  existing  error  function  like  least 
square. 

®(®)  =  -  Opif  -  rSrEf  -  »,.)")’  (3) 

i  p  i  p 

where  t;  is  a  positive  penalty  parameter,  N  the  number  of  output  units  and  P  the  number  of 
patterns.  The  derivative  to  (3)  is 

-  Opp)’  -  ra-Ef' E,'’(V  -  m 

From  (4)  we  observe,  that  while  the  exponential  error  function  can  be  used  in  both  online  and 
offline  training  mode,  the  minimum  variance  error  function  can  only  be  applied  in  offline  mode. 

3  Experiments 

In  this  section,  we  compare  the  least  mean  square  error  function,  the  exponential  error  function 
and  the  minimum  variance  error  function. 

To  be  able  to  see  how  the  different  error  functions  compare  on  problems  with  varying  input 
dimensions  some  artificial  data  were  generated.  For  dimension  N  a  set  of  4N  centerpoints,  each  a 
N-bit  string,  was  randomly  chosen.  Around  each  centerpoint  a  set  of  9  distortions  was  generated 
using  a  Gaussian  distribution  to  determine  whether  to  fflp  a  bit  or  not.  This  gives  a  total  of  40N 
patterns.  Each  centerpoint  and  its  distortions  were  then  randomly  assigned  to  one  out  of  two 
possible  classes. 

It  is  widely  recognized  that  the  class  of  conjugate  gradient  algorithms  are  well  suited  for  learning 
algorithms  because  of  their  ability  to  gain  second  order  information  without  too  much  calculation 
work  [Battiti  92].  One,  the  Scaled  Conjugate  Gradient  algorithm  [M0ller  92a],  has  especially  low 
calculation  costs,  and  has  for  that  reason  been  used  in  the  experiments  to  follow. 

3.1  Training 

The  three  error  functions  were  tested  on  dimension  8,10,12,14,16  and  18  running  5  different  runs  on 
each  dimension  using  a  3  layer  network  with  N  hidden  units.  The  training  was  first  terminated  when 
all  patterns  were  classified  correctly  or  until  a  resonable  limit  was  reached.  Table  1  summarizes  the 
average  results  obtained,  a  was  set  to  1.  The  initial  0  was  set  to  0.9  and  then  halfed  every  time 
no  errors  were  inside  the  penalized  regions.  The  penalty  parameter  t)  was  set  to  1-2. 

In  the  runs  with  the  least  mean  square  error  function  only  a  few  global  solutions  was  found  with 
a  100%  correct  classification.  Training  on  the  other  two  error  functions,  however,  yielded  optimal 
solutions  in  all  runs.  The  exponential  error  function  seems  to  give  the  fastest  convergence,  but  this 
might  be  because  of  the  actual  values  of  a,  0  and  i}. 
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Dim 

Least  Mean  Square 

Exponential  Error 

Variance  Error 

Epoch 

Correct 

Epoch 

Correct 

Epoch 

Correct 

a 

o 

A* 

a 

a 

t* 

<r 

8 

487 

73 

.984 

.004 

76 

3 

1 

0 

82 

13 

1 

0 

10 

493 

15 

.983 

.008 

112 

19 

1 

0 

147 

23 

1 

0 

12 

413 

54 

.988 

.000 

109 

10 

1 

0 

111 

10 

1 

0 

14 

478 

37 

.993 

.002 

77 

12 

1 

0 

81 

5 

1 

0 

16 

490 

24 

.994 

.001 

75 

1 

1 

0 

94 

13 

1 

0 

18 

447 

46 

.996 

.004 

78 

7 

1 

0 

89 

4 

1 

0 

Table  1:  Average  results  on  artificial  data,  ft  =  mean  and  a  =  standard  deviation. 

3.2  Generalization 

In  this  section  we  investigate  the  generalization  ability  of  network  solutions  found  by  minimization 
of  the  different  error  functions.  Again  some  artificial  data  were  generated,  this  time  with  continuous 
input  constrained  between  0  and  1.  We  chose  dimension  10  with  20  centerpoints,  50  distortions  per 
centerpoint  and  4  possible  output  classes.  The  average  overlap  between  the  centerpoints  was  4%, 
meaning  that  4%  of  the  distortions  were  nearer  other  centerpoints  than  the  one  they  were  generated 
&om.  The  set  of  patterns  was  then  split  in  to  a  training  set,  validation  set  and  a  test  set  of  equal 
size.  When  applying  the  k-nearest  neighbor  technique  on  the  data  we  got  a  max  performance  of 
94.26%  on  the  validation  set  giving  93.69%  on  the  test  set  (k=5).  Because  of  the  way  the  data 
is  generated  we  would  not  expect  the  neural  network  solution  to  do  much  better  than  that.  We 
ran  the  following  experiments.  SCG  was  tested  on  the  least  square  error  function,  the  exponential 
error  function  and  the  minimimum  variance  error  function.  5  different  runs  were  made  for  each 
test.  When  the  classification  rate  of  the  validation  set  was  at  it  highest  the  number  of  iterations 
run  and  the  classification  rate  of  the  test  set  were  recorded. 

The  results  are  illustrated  in  figure  4.  We  observe  the  same  trend  for  both  the  exponential  error 
function  and  the  minimum  variance  error  function.  The  higher  the  a  and  t}  values  the  better  the 
generalization.  For  q  equal  to  30  there  is  a  decrease  in  generalization.  At  this  point  the  constraint 
towards  low  variance  was  too  strong.  Unfortunatly,  this  gain  in  generalization  is  done  at  the 
expense  of  the  convergence  rate  as  the  figure  also  show.  This  is,  however,  not  surprising  since  high 
a  and  t)  values  impose  a  tougher  constradnt  on  the  acceptable  path  down  to  the  minimum.  The 
minimum  variance-  and  the  exponential  error  function  gives  approximately  the  same  maximum 
generalization  performance  as  the  k-nearest  neighbor.  At  this  maximum  generalization  point  the 
convergence  rate  of  the  minimum  variance  error  funtion  is  slightly  higher  than  the  convergence  rate 
of  the  exponential  error  function. 

4  Conclusion 

This  paper  has  shown  that  imposing  appropriate  constraints  on  network  solutions  can  improve 
convergence  and  generalization.  We  have  proposed  two  new  error  functions  that  impose  such 
constraints.  We  do  not  claim  that  these  functions  are  in  any  way  optimal,  but  we  do  believe  that 
our  results  illustrates  the  neccesity  of  adding  such  constraints.  Minimization  with  the  new  error 
functions  produce  in  average  better  solutions  with  respect  to  generalization  than  the  least  mean 
square  error  function. 

The  quality  of  the  solutions  found  with  the  new  error  functions  depends  heavily  on  the  values 
of  the  constraint  parameters  a  and  i].  We  have  not  addressed  the  problem  of  choosing  optimal 


Figure  3:  Results  on  the  test  set  using  the  exponential  error  function  and  the  minimum  variance 
error  function  with  different  a  and  tf  values. 

values  of  a  and  ij.  Several  heuristic  methods  could  be  applied,  like  starting  with  a  small  value  and 
then  slowly  increase.  More  sophisticated  techniques,  like  the  ones  used  to  estimate  appropriate 
regularization  pairameters,  might  also  be  usuable  in  this  context. 

It  would  be  interesting  to  know  how  the  distribution  of  the  errors  on  the  training  set  influence  the 
generalization  ability.  Our  results  indicate  that  the  more  balanced  the  distribution  is,  i.e,  the  more 
equal  the  errors  are  in  magnitude,  the  better  generalization  one  can  expect.  It  remains  to  future 
work  to  actually  prove  the  relationship  between  expected  generalization  and  error  distribution. 
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In  the  past  several  years,  feedforward  neural  networks  were  developed  rapidly,  espec¬ 
ially  a  number  of  papers  improved  the  Back  Propagation  algorithms.  However,  gener¬ 
ally  speaking,  they  used  gradient  descent  techniques  on  the  error  hypersurface.  In  this 
talk,  we  mainly  discuss  the  methods  of  digging  tunnels  into  the  error  hypersurface. 
Two  digging  methods  are  presented:  one  is  digging  horizontally,  another  is  digging 
down  into  the  error  hypersurface.  Both  methods  use  structure  variation  idea.  Since 
multilayer  perceptron  (MLP)  training  intrinsically  solves  a  nonlinear  problem,  and 
MLPs  are  white  boxes  in  which  the  weights  and  thresholds  can  be  added,  deleted  and 
renewed  purposely,  it  is  unnecessary  to  always  use  the  traditional  gradient  descent 
techniques. 

Our  idea  is  as  follows:  (1)  An  MLP  is  trained  with  a  relatively  small  number  of 
hidden  neurons  using  traditional  gradient  descent  techniques  until  the  algorithm  is 
trapped  in  local  minima.  (2)  A  tunnel  is  digged  horizontally  to  move  the  MLP  to  an¬ 
other  isohypse  position  on  the  error  hypersurface  and  use  gradient  descent  techniques 
again.  This  digging  method  can  be  called  rotation  transformation,  in  which  a  hidden 
neuron  is  added  by  certain  rules  and  the  Perceptron  Convergence  Theorem  is  used, 
then  an  original  hidden  neuron  is  removed  according  to  the  correlation  of  the  outputs 
in  the  hidden  layer.  Hence  the  number  of  hidden  neurons  will  not  increase.  (3)  Or  a 
tunnel  is  digged  down  into  the  error  hypersurface.  This  digging  method  is  also  called 
the  compensation  method,  in  which  hidden  neurons  are  also  added.  But  whenever  a 
hidden  neuron  is  added,  its  input  and  output  weights  and  threshold  are  calculated 
definitely  rather  than  iterated.  Thus,  it  ensures  the  global  convergence. 
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Abstract  -  Hardware  implementation  of  learning  algorithms  for  recurrent  neural  networks  is  considered. 
Recurrent  back-propagation  and  diffusion  optimization  are  modified  so  as  to  work  on-line,  and  circuit  realization 
scheme  is  proposed.  A  new  error  function  is  defined,  specifically  suitable  for  continuous  input  -  discrete  output 
mapping  tasks.  Simulation  results  are  presented  for  a  simple  cellular  neural  network  problem. 

1.  Introduction 

Supervised  learning  for  Neural  Networks  (NNs)  is  a  very  complex  optimization  problem,  that  is  generally 
solved  by  lengthy  computations  on  digital  (conventional  or  parallel)  compute.  For  this  reason,  large  computing 
resources  (time  andjor  power)  are  required,  so  that  in  many  cases  real-time  problems  are  hardly  accessible. 

Therefore,  it  is  desirable  to  realize  fully  parallel  hardware  implementation  of  NNs  that  include  learning 
in  the  same  system,  which  would  also  allow  for  re^-time  on-line  adapt^ility  of  the  net 

Several  authors  have  studied  implementatitm  of  learning  algorithms;  howevo-,  cmly  few  results  have  been 
published  concerning  supervised  learning  for  feedforward  networks.  Back-Propagation  (BP)  was  implemented  by 
H.  Eguchi  et  al.  (1991)  by  using  pulse  frequency  encoding  for  signals.  In  this  technique,  multiplication  and 
additionAionlinear  squashing  are  performed  by  very  simple  circuitry  (AND  and  OR  gates,  respectively). 
Concerning  fully  analog  itiqrlementations,  M.  Hasler  (1993)  proposed  a  continuous-time  realization  of  BP  by  use 
of  a  resistive  circuit  (the  adjoint  of  the  NN). 

In  this  paper,  the  problem  of  hardware  learning  for  recurrent  NNs  is  addressed.  The  author  in  aware  of 
no  feasible  solution  pnqx»ed  to  date.  Application  of  Recurr»it  Back-Propagation  (RBP),  and  of  Diffusion 
Optimization  is  ctmsidered.  The  main  advantages  of  these  algorithms,  in  view  of  hardware  realization,  are: 
continuous  time  operation:  calculation  of  weight  corrections  performed  locally;  no  memory  required  (no  batch 
operation).  Passible  circuit  implementations  are  prcqxtsed,  and  results  of  simulations  performed  on  Cellular 
Neural  Networks  reported. 

2.  Definitions  and  Notations 

2.1  Honfield  and  Cellular  Neural  Networks 

In  order  to  fix  notations,  define  a  Hopfteld  Neural  Network  (HNN  -  Hopfieid,  1984)  as  follows: 

'^X^  =  -Xi+'ZWijyj+Ui 
at  j 

yi=/U);  f{-x)  =  -f{x},  /'(x)sO;  lirn  f{x)  =  ±l 

JC-»±o<» 

Vector  X  will  be  called  state  of  the  network,  y  is  output,  u  is  input,  or  threshold,  and  matrix  W  is  the  weight 
matrix.  Function/is  a  sigmoidal,  or  squashing  functiort. 

Define  Cellular  Neural  Network  ((TNN  -  Chua  and  Yang,  1988)  as  a  HNN  in  which  neurons  are  only 
crmnected  to  neighbors,  and  weights  are  defined  in  a  uniform  way  over  the  networic.  Without  loss  of  generality,  we 
consider  a  (TNN  in  which  neurons  are  arranged  on  a  planar  square  grid,  and  indexed  by  double  indices.  Neurons 
are  ctnuiected,  whose  indices  do  not  differ  more  than  r.  Relevant  equations  are  written  as  follows: 


*  On  leave  from:  Dipartimento  di  Ingegneria  Elettroruca,  University  "La  Sapienza"  di  Roma,  via  Eudossiana,  18  - 
00184  Roma  Italy.  E-mail:  mb@tce.ing.uiurotnal.it 


in-451 


QXii 

kleNriy)  «eW^(j/) 


yy  =  f(^y  }  fix)  =  |(lx  +  Ij  -|j:  -  Ij) 

In  this  case,  input  vector  u,  weighted  by  control  matrix  B.  is  considered  distinct  hrom  bias  /.  Nf{ij)  is  a 
function  yielding  the  set  of  indices  of  neighbors  of  neuron  ij.  which  is  a  square  of  side  {2r+\)  centred  on  ij. 

It  is  apparent  that  in  this  case  there  are  only  a  small  number  of  independent  weights,  which  can  be 
arranged  into  (2r-fl)x(2r-«-l)  matrices  W  and  B,  plus  a  scalar  bias  /.  If,  on  the  contrary,  weights  are  allowed  to 
v»y  independently  over  the  network,  the  corresponding  network  will  be  called  a  General  CNN  (GCNN). 

2.2  Learning  problem 

We  shall  ctmsider  use  of  HNNs  and  CNNs  as  mappers  from  the  continuous  space  )fixU  of  initial  states 


and  inputs  into  the  space  Y"  =([-l,-l  +  e]u[l-e,l])^  (where  N  is  the  number  of  neurons)  of  saturated 


outputs;  e  is  a  small  positive  number,  and  when  /  is  piece-wise  linear,  y°°  =  {-1,1}^.  Defme 


X°°  =[(-*>. -Jt,a/]u[x,a,,+oo)]^,  where  is  a  positive  number,  so  that  yeY°°  when  x  e  X°°  (for 
CNNs, 

Under  suitable  conditions  on  weights  and  derivative  of  /  (Hopfield,  1984;  Chua  and  Yang,  1988),  the 
network  is  asymptotically  stable,  and  equilibria  belong  to  y°°.  We  shall  assume  that  such  conditions  are  enforced. 

The  learning  task  considered  consists  of  realizing  a  given  mapping  Vl:X^  xU  ->  y°°,  given  a  set  (tf 
learning  examples  (training  set)  a={Jt^,tt^‘,^^^;^l  =  l,2,...,A/}c  Y®xf/x{-l,l}^  (i.e.  a  set  of  triplets 


formed  by  initial  state,  input,  and  desired  output).  We  are  not  addressing  here  the  problem  of  generalization, 
however  we  note  that  in  the  case  of  CNNs  a  single  learning  example  (Af=l)  may  be  enough  to  define  a  task, 
because,  due  to  the  space-invariant  property  of  the  cloning  tonplate,  it  is  in  some  sense  equivalent  to  as  many 
independent  examples  as  there  are  neighborhood-sized  subsets  contained  in  it. 

The  training  set  is  the  only  information  given  to  the  algorithm,  besides  network  topology.  In  fact, 
external  control  and  data  communication  should  be  minimized,  in  order  to  exploit  the  full  speed  of  parallel  analog 
computation.  For  the  same  reason,  emphasis  is  put  on  simplicity  of  realization,  rather  than  on  computing  time;  all 
calculations  are  to  be  done  locally,  and  if  memory  is  required,  it  should  also  be  local,  and  preferably  analog. 

2.:^  Recurrent  Back-Pmpagarion  and  Diffusion  Learning 

RBP,  defined  by  F.  Pineda  (1987),  is  analogous  to  BP  for  recurrent  networks.  Unlike  BP,  it  is  defined  in 
continuous  time,  and  does  not  need  sqMuate  fcHward-  and  back-propagation  phases,  which  simplifies  circuit 
timing  issues.  It  can  be  realized  by  adding  to  the  network  an  adjoint  net  that  has  the  same  topologic^  properties. 
RBP  changes  weights  dynamically  by  making  their  time  d^iv^ves  proportional  to  the  opposite  of  the  derivative 
of  errm  fimcticHi  widi  respect  to  die  weight  considered; 
d^y  BE 

dt  dWjj 

Relevant  equations  for  RBP  for  HNNs  are  as  follows  (Pineda,  1987;  Balsi,  1993); 


The  first  equation  describes  weight  dynamics,  while  the  second  represents  the  adjoint  BP  network,  which 
appears  to  be  a  resistive  net,  with  the  same  cormection  topology  as  the  forward  net,  except  for  direction  reversal  (tf 


connections,  as  seen  from  weight  matrix  transposition.  Symbtd  denotes  equilibrium  state  reached  by  applying 
initial  state  and  input  of  example  p.  £  is  error  function,  to  be  defmed  below. 
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Diffusion  opdinizatian  (Geman  and  Hwang,  1986)  can  be  seen  as  a  gradient  descent  method  with  added 
noise,  which  decreases  slowly,  so  that  parameters  convo'ge  in  probability  to  global  minimizers  of  the  error 
fiiiKtion.  Continuous  aiuiealing  (a  discrete-time  version  of  diffusion,  or,  fnan  another  point  of  view,  a  continuous- 
space  version  of  simulated  aiuiealing)  has  been  used  by  a  few  authors  (Hoptroff  and  Hall,  1989)  for  NN  learning; 
Wong  (1991)  used  diffusion  learning  for  stochastic  HNNs  both  as  a  means  of  obtaining  global  convergence  during 
opoation  and  for  learning,  also  hinting  at  hardware  realization.  His  method,  however,  is  not  immediately 
translated  to  deterministic  networks,  as  those  ctmsidered  here. 

Diffusion  learning  can  be  defined  for  the  HNN  as  follows: 


ag 

dWy 


+  Tw 


where  M<0  is  a  white  noise,  which  is  (in  a  weak  sense)  the  time  derivative  of  a  Wiener  process  (Pugachev  and 
Sinitsyn,  1987).  It  is  apparent,  that  the  diffusion  algorithm  may  be  obtained  by  adding  a  noise  term  to  RBP 
learning  equations. 


3.  Error  Functitm  -  Realization  Issues  •  Modified  Equations 


For  the  learning  problem  stated  in  section  2.2,  we  chose  to  use  a  new  "tailor-cut"  error  fimction.  In  fact, 
when  a  traditional  output-based  function  is  used,  shallow  error  surfaces  arise,  especially  when  output  function  is 
piece-wise  linear,  as  is  the  case  with  CNNs.  Howevo',  using  a  state-based  function,  while  effectively  improving 
firom  this  point  of  view  (Schuler.  1993),  adds  unwanted  additional  constraints  to  the  output,  which  may  even 
prevent  a  working  solution  firom  being  reached. 

In  order  to  have  the  advantages  of  both  ap{m>aches,  without  their  disadvantages,  the  new  error  function  is 
defined  as  follows: 

pi  i  i  ^  ^ 

Error  is  written  as  a  sum  of  errors  over  individual  neurons  (indexed  by  i)  and  examples  (indexed  by  p.). 


8^  is  the  diird  integral  of  the  Dirac  pulse,  e.g.: 
0  ifjc<0 

I  2 
2" 


5<-3>(x)  = 


else 


1  +  pmin  ^  ^sat  minimum  acceptable  magnitude  of  equilibrium  state  value,  while  1 + is  its  maximum.  The 
second  term,  involving  maximum  state,  may  be  (Knitted,  but  it  has  the  advantage  of  preventing  weights  firom 
drifting  towards  bigger  and  bigger  values,  especudly  when  the  algorithm  is  implemwted  in  such  a  way  as  to  be 
sensible  to  the  time  integral  of  error,  which  is  reduc^  whm  outputs  are  saturated  earlier. 

The  quadratic  form  has  the  advantage  of  causing  error 
derivatives  to  be  piece-wise  linear,  sinqilifying 
implementation.  In  fact,  relevant  function  to  be 
implemented  is  die  following: 


A  generic  error  component  Ef  (solid  line),  and 
function  /g .  are  plotted  in  figure  1  for  C=  I .  ^  *  error  function  and  inverted  derivative 

Learning  equations,  as  written  in  section  2.3,  imply  batch  processing  of  examples,  which  would  mean 
memorizing  corrections  and  separating  forward  and  backward  propagation  phases.  Simplified  operation  is 
obtained  by  exploiting  integration  over  time,  while  presenting  examples  in  a  (deterministic  or  stochastic) 
succession,  keqiing  each  exanqile  steady  for  a  prescribed  period. 

At  the  same  time,  iristead  of  using  etpiilibrium  states,  transient  states  are  used.  Period  of  example 
{Hesentaticm  should  be  long  enough  for  the  network  to  relax,  and  to  stay  steady  at  equilibrium  for  a  time  long 
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enough  to  make  negligible  wrong  contribution  to  weight  correction  due  to  transient.  In  this  functioning  mode,  the 
algorithm  is  actually  sensible  to  the  integral  of  error  over  time,  so  that  faster  solutions  are  preferred.  With  a  little 
addition  to  the  system,  effect  of  transient  may  be  masked  by  interrupting  learning  for  a  suitable  time  after  a  new 
example  is  presented.  This  may  be  obtained  by  fixing  error  signals  at  zero. 

With  these  considerations  in  mind,  learning  equations  will  be  written  as  follows  for  diffusion  (RBP  is 
obtained  in  the  same  form  for  7=0): 


J 

The  following  equations  define  diffusion  learning  (and.  for  7=0,  RBP)  for  CNNs.  In  this  case,  parameters 
to  be  learned  are  not  only  state  weights,  but  also  input  (control)  weights  and  bias. 

^  ZWy*+/;/+ j  +  Tw 

~  5^  ^i-kj-lf  (■*(/ ~  •  ^ij  ) 

kleNrdj) 

=  Z  /fiiCy  .-ty  )uk+i  j+ j  +  Tw 
kl 

=  fE(<>ij’Xij )  +  Tw 
at  jj 

4.  Proposed  Hardware  Realization 

In  figure  2,  realization  of  a  HNN  with  hardware  learning  is  presented  schematically,  by  making  use  of 
transconductance  amplifiers.  One  neuron  is  considered,  with  only  one  connection  represented.  Output  function  is 
here  supposed  to  be  piece-wise  linear,  so  that  a  controlled  switch  is  sufficient  to  realize  multiplication  by  its 
derivative,  whose  only  possible  values  are  0  and  1  (figure  3(b)).  In  weight  ad^tation  section,  a  controlled  source 
is  driven  by  resistor  noise,  to  produce  the  stochastic  term  fw  the  diffusion  algorithm.  This  is  obviously  just  a  way 
of  representing  the  necessary  operation,  while  its  actual  realization  should  be  considered  at  a  stage  where 
technology  is  chosen. 


(a)  neuron  dynamics 


(b)  gradient  evaluation 


^Rn 


(c)  weight  adaptation 


Figure  2  -  hardware  implementation 


5.  Simulation  Results 


Simulations  were  performed  on  CNNs.  In  a  previous  paper  (Balsi,  1993),  1  reported  results  concerning 
tq)plication  of  RBP  to  GCNNs.  In  that  case,  individual  weights  are  not  subject  to  as  many  constraints  as  for  CNNs. 
For  the  latter,  in  fact,  global  functioning  is  only  governed  by  a  few  parameters,  forming  the  cloning  template.  This 
is  reflected  in  learning  equations,  where  a  global  connection  pattern  arises  (summations  over  all  neurons  are 
present),  that  is  not  present  in  the  case  of  GCNNs  or  HNNs,  where  learning  equations  are  local. 

When  applied  to  CNN  learning,  RBP  proved  very  prone  to  get  stuck  into  local  minima,  so  that  learning 
was  successful  c^y  when  starting  weights  were  very  close  to  a  valid  solution.  In  fact,  correct  functioning  was 
obtained  only  when  starting  weights  guaranteed  correct  sign  of  equilibrium  points. 

For  this  reason,  I  tried  using  random  presentation  of  examples  as  a  way  of  climbing  out  of  the  said  local 
minima,  as  proposed  by  many  authors  for  BP  (Heskes  et  al.,  1992).  In  this  way,  the  CNN  can  correctly  learn  what 
is  generally  called  "noise  Altering"  cloning  template.  This  functionality  consists  of  bringing  to  negative  final 
ouqiut  all  those  neurons  that  have  positive  initi^  state,  but  are  sunounded  by  negative  ouqiut  neurons,  while 
leaving  everything  else  as  it  is.  The  name  is  due  to  the  fact,  that  by  associating  light  intensity  of  pixels  of  an  image 
with  the  outputs  of  neurons  of  a  planar  CNN,  isolated  lighted  pixels  are  removed,  and  the  image  smoothed. 


i 


0  2000000  4000000  6000000 


(a)  weight  evolution  -RBP 


(b)  maximum  error  -  RBP 


Figure  3  shows  evolution  of  weight  values  (a)  and  maximum  error  as  a  function  of  time,  measured  in  neuron  time 
constants  t  x .  obtained  in  a  simulation  of  a  one-dimensional  5-neuron  CNN,  performed  by  integrating  equations 
with  a  modiAed  Runge-Kutta-Merson  algorithm,  with  32  examples.  Simulation  was  interrupted  after  error  had 
stayed  at  zero  for  a  while;  weights  would  have  actually  settled  at  larger  values  because  of  parameters  chosen  (large 
Pmax)'  representing  them  would  make  the  Agure  less  readable. 


As  it  is  said  above,  a  CNN  task  may  also  be  specified  by  a  single  example,  which  simplifies  control  and 
communication  very  much.  As  in  this  case  random  motion  cannot  be  obtained  from  the  data,  it  is  necessary  to  use 
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diffusion  learning,  which  provides  built-in,  controlled  stochasticity.  In  fact,  by  using  the  diffusion  algorithm,  iui 
11-neuron  CNN  was  taught  the  same  noise  filtering  task  with  a  single  example.  Figure  4(a)  shows  weight 
evolution  during  learning,  while  figure  4(b)  shows  learning  error  E.  Computed  cloniong  templete  works  correctly 
in  all  cases. 

In  both  cases,  it  is  apparent  that  the  role  of  gradient  force  is  keeping  steady  those  correct  weight  patterns 
that  are  actually  found  by  random  motion. 

6.  Conclusions,  Open  Problems,  Perspectives. 

Supervised  learning  algorithms  for  recurrent  networks  were  adapted  for  hardware  realizability,  and  tested 
by  simulation.  The  case  presented  is  very  simple,  because  of  complexity  of  simulation;  however,  it  proves 
feasibility  of  the  methods  presented.  In  fact,  results  were  obtained  under  realistic  constraints:  in  particular,  limited 
range  and  bandwidth  of  electronic  circuits.  These  preliminary  results,  therefore,  encourage  further  research 
towards  realization  of  a  completely  analog  network,  capable  of  real-time  on-line  learning. 

One  of  the  main  open  problems  is  caused  by  limitation  of  weight  and  state  values,  due  to  supply  voltage 
constraints,  that,  in  some  cases,  causes  the  algorithm  to  get  stuck,  for  a  possibly  long  time,  on  wrong  solutions. 
Methods  to  avoid  such  failures  should  still  be  investigated.  Further  investigation  should  also  explore  practical 
circuit  implementation.  In  relation  to  this  issue,  some  aspects  of  the  algorithms  (e.g.  pattern  presentation  schemes, 
noise  exploitation)  might  be  adapted  to  physical  constraints. 

A  particular  issue  concerns  CNNs.  In  fact,  as  noted  above,  cloning  template  learning  involves  global 
evaluation  of  the  problem  being  solved,  while  the  network  only  has  local  information  flowing.  This  issue  poses 
serious  difficulties  to  learning  in  cases  characterized  by  diffusion  of  information  over  the  whole  network.  Solution 
to  such  problems  goes  beyond  the  scope  of  this  paper. 

Continuation  of  the  work  will  aim  at  designing  a  practical  system.  The  purpose  is  twofold:  making  a 
complete  adaptive  neural  machine,  to  be  applied  to  real-time  problems,  and  realizing  a  learning  system  to  be  used 
in  development  of  special-purpose  networks.  This  last  case  might  be  of  interest  in  particular  in  the  case  of  CNNs, 
where  uniformity  of  the  system  makes  solutions  found  on  small  nets  immediately  scalable  to  large  problems. 
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Abstract 

There  have  been  numerous  proposed  algorithms  to  speed  up  the  learning  time  of  backpropagation. 
However,  most  of  them  do  not  take  into  consideration  the  amount  of  hardware  required  to  implement  the  algorithm. 
Without  suitable  hardware  implementation,  the  real  promise  of  neural  network  applications  will  be  difficult  to 
achieve.  There  is  a  need  for  special  purpose  hardware,  particularly  in  specialized  integrated  circuits  to  serve  in  high 
performance  real-time  applications.  This  paper  proposes  an  adapted  backpropagation  algorithm  to  be  judged  by  the 
measure  of  speed  and  area  if  it  is  implemented  with  digital  VLSI.  Since  multiply  dominates  computation  and  is 
expensive  in  hardware,  the  approach  is  to  reduce  the  number  of  multiplies  in  the  backward  path  of 
backpropagation  algorithm  by  setting  some  neuron  errors  to  zero.  This  paper  proves  the  convergence  theorem 
by  the  general  Robbins-Monro  process,  a  stochastic  tq>proximation  process.  It  is  valid  if  neuron  errors  are  set  to 
zero  randomly  and  the  learning  rate  decreases  with  time.  However,  setting  the  neuron  errors  to  zero  randomly  is  slow 
compared  to  the  standard  algorithm.  So,  this  paper  proposes  why  neuron  errors  should  be  set  to  zero  according  to 
their  magnitudes.  The  theory  is  confirmed  with  simulation  results  of  a  character  recognition  problem  by  minimizing 
errors  only  and  a  function  approximation  problem  with  testing  patterns  to  monitor  generalization  performance. 
Finally,  ht^ware  implementation  is  discuss^  and  the  area  comparison  is  shown.  The  conclusion  is  that  the  reduced 
operation  algorithm  performance  in  terms  of  speed  and  area  is  superior  to  a  standard  backpropagation  algorithm. 

Introduction 


Layer  1-1,1  Layer  L-1,L 

Figure  1;  Data  flow  diagram  of  backpropagation 

A  data  flow  diagram  of  backpropagation  [1]  algorithm  is  shown  in  figure  1.  It  was  first  shown  in  [2].  In 
the  diagram,  data  are  express^  as  vectors  and  matrices  so  that  all  operations  can  be  written  as  vector-vector  and 
matrix-vector  products,  sum^'-  and  y^'^  is  a  vector  whoe  each  component  sum*)'  and  yp)'  is  an  input  and  an  output  of 
neuron  j  in  D"*?  1  with  pattern  p,  respectively,  dp  is  a  desired  output  vector  with  similar  definition,  w*’’  and  Aw^'^ 
is  a  weight  -''•'k  a  change  of  weight  matrix  where  each  component  wji''  and  Aw)!'  is  a  weight  and  a  change  of  weight 
connection  'oeuveen  neuron  i  in  layer  1-1  to  neuron  j  in  layer  1,  respectively,  e^''  is  a  neuron  error  vector  where  each 
component  e))'  is  an  error  term  associated  with  neuron  j  in  layer  I  with  pattern  p.  The  matrix-vector  multiplication  is 
rqnesented  by  •.  KP  is  a  Kronecker  product  or  a  tom  by  term  multiplication  of  two  equal  length  vectors.  OP  is  an 
outer  product  of  a  column  and  a  row  vector  to  expand  a  matrix,  f  is  a  sigmoid  function  and  f '  is  its  derivative.  T 
rqiresents  a  transpose  and  r  is  the  learning  rate. 

The  data  flow  diagram  can  be  used  to  estimate  the  number  of  multiplications  required  which  is  a  good 
indication  of  the  complexity  of  the  hardware  needed  to  implement  the  algorithm  because  multiplication  is  expensive 
and  dominates  the  whole  computation.  For  simplicity,  assume  that  all  layers  have  the  same  number  of  neurons,  N. 
In  the  forward  path,  between  each  layer,  there  are  multiplies  represented  by  •.  In  the  backward  path,  •  takes  bP 
multiplies,  the  same  as  the  forward  path.  KP  needs  N  multiplies  and  OP  needs  2N^  multiplies  (including  the 
multiplies  of  the  learning  rate,  r,  and  assuming  the  weights  are  updated  after  each  pattern).  Except  for  the  last  layer 
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which  has  2hP  +  N  multiplies,  there  are  3N^  +  N  multiplies  for  the  backward  computation  between  layer  1- 1  iind  1. 
Consequently,  the  number  of  multiplications  in  the  backward  path  is  about  three  times  that  in  the  forward  path. 

Since  the  backward  path  requires  considerably  more  multiplications,  we  propose  an  approach  to  reduce 
them.  In  figure  1,  if  we  set  some  elements  of  in  each  layer  to  zero,  the  number  of  multiplies  is  reduced.  Since 
e^'-  is  computed  recursively  for  each  layer,  the  number  of  multiplications  is  reduced  for  all  layers.  Agtiin,  assume 
that  all  layers  have  N  neurons  and  that  K  of  them  are  kept.  N-K  out  of  N  components  of  ep  are  set  to  zero.  In  the 
backward  path,  •  now  requires  KN  multiplies.  KP  still  needs  N  multiplies  and  OP  needs  2KN  multiplies.  The 
total  number  is  reduced  from  3hP  +  N  to  3KN  +  N,  about  a  factor  of  N/k.  Undoubtedly,  this  reduction  changes  the 
normal  backprq)agation  algorithm.  The  proof  is  presented  next. 


General  Robbins-Monro  Process 

The  general  Robbins-Monro  process  with  exogenous  noise  [3]  will  be  used  to  prove  the  reduced  operation 
backpropagation.  For  the  proof,  the  process  is  given  by 

w„+i  =  w„ +a„h(w„,z„)  (1) 

where  Wn  is  the  state  of  the  process  at  the  n**’  estimate  of  the  optim^  value  of  w.  The  sequence  (wn }  is  assumed  to 
be  bounded  w.p.l.  zi,  is  a  random  variable  which  is  independent  of  the  state  Wn  and  its  past  values.  The  sequence 
|2« }  is  called  exogenous  noise. 

There  are  three  conditions  (Cl.l,  C1.2,  and  Cl. 3)  to  be  satisfied  for  a  convergence  w.p.l  of  the  sequence 
{wn).  They  are 

Cl.l  h  is  a  bounded  measurable  Revalued  function.  It  is  continuous  in  w,  uniformly  in  z  on  bounded  w  sets. 
C1.2  For  each  e  >  0  and  each  w 

m  ~ 

Urn  P{sup  I  Sai(h(w,Zi)  -  h(w))  I  >  e}  =  0 

m^n  i^n 

n 

Cl  .3  {% }  is  a  sequence  of  positive  real  numbers  such  that  ai  — >  0  and  ^ai  =  ooasn  — y«>. 

With  the  three  required  conditions  satisfied,  the  sequence  { Wn  |  (if  bounded  w.p.  1)  will  be  interpolated  into  a 
continuous  parameter  process  and  have  the  same  asymptotic  properties  as  those  of  the  solution  to  an  ordinary 
differential  equation  _ 

^  =  h(wt)  (2) 

dt 


Application  To  Reduced  Operation  Algorithm 

The  general  Robbins-Monro  process  can  now  be  appli^  to  prove  the  convergence  of  the  reduced  operation 
proposal.  Equation  (1)  is  the  equation  to  update  all  the  weights  in  the  weight  space.  That  is  if  we  define  Wn  to  be  a 
state  of  the  neural  network  at  the  n*^  update.  The  formula  used  to  update  weights  per  state  Wn  is 

h(w,Zi)  =  (h(wi‘/,zi), ... ,  h(wjklzi) ...  ,h(w,^L-,'Zi))^  (3) 

where  each  component  is  used  to  update  each  weight.  We  define  the  random  variable  Zi  as 

7  -  n:r  (A.\ 

Zi  —  (Sij  «  Sj  Sjj  ,  ^^2*'"* 

where  s-Jj  is  a  selection  variable.  For  the  i*  update,  slj  is  1  if  the  conesponding  error  e^J  is  selected.  It  is  0, 
otherwise.  The  selection  is  random  and  pattern  independent.  Each  error  in  the  same  layer  has  an  equal  chance  of 
being  selected.  Another  random  variable  is  pi  which  is  the  index  for  pattern  p  in  the  i*^  update.  We  choose  at 
random  (uniformly)  an  integer  pj  e  {1,2,  ....  P}  where  P  is  the  total  number  of  patterns  in  the  training  set.  From 
now  on,  we  will  use  pi  in  place  of  p  to  emphasize  randomness  of  pattern  selections. 

Now,  we  can  write  the  formula  to  update  a  weight  in  term  of  a  modified  error  term.  That  is 


where  the  modified  error  term  Vp'j  is  derived  in  the  same  recursive  manner  as  e^'/j  of  normal  backpropagation 
algorithm.  It  is  always  multiplied  by  Sjj.  For  the  output  layer  L 

=  ep^U  =  (<lpij  -  s.^'  (6) 

and  for  a  hidden  layer  1 


h(w51’A)  =  vg’ijy^i’ 


(5) 


We  wiU  write  h(w^\zi)  for  a  few  layers  explicitly  in  terms  of  all  s-'-  it  depends  on  so  that  later  proofs  will 


v“’  - 

''Plj 


Lfow  sn  *A«vnc  oil  o'''  i 


(7) 


be  easier  to  undeistand.  For  the  output  layer  L 

h(wji'U)  =  (dpi)  -  y^i,5  P'(sum^u^  s^Vp^k* 


(8) 
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for  layer  L-1 
and  for  layer  L-2 


nl-1  nl 


(9) 


(10) 


The  next  step  is  to  show  that  Cl.l,  C1.2,  and  C1.3  are  satisfied.  Cl.l  is  satisfied  if  we  we  assume  that 
each  weight  is  bound^.  Cl. 3  can  be  satisfied  by  choosing  the  learning  rate  appropriately  such  as  ai  =  c/n  where  c  is 
a  positive  real  number.  The  most  difficult  one  is  Cl. 2.  First,  we  define 

h(w)  =  E(h(w,Zi)]  (11) 

where  E  is  expectation.  We  also  define 

Yi  =  h(w,2i)  -  h(w)  (12) 

So  we  have  to  show  that 


lim  Plsupl^aiVil  ^  el  =  0 

n  i*n 


(13) 


[3]  shows,  by  using  the  martingale  inequality  of  Doob,  that  (13)  holds  if  E  I  ^  aYi  I  <  <»  as  n  — >  0°,  and  if  ^  aiYi 
is  a  martingale  sequence.  Also,  assume  that  2  a;^  <  00  as  n  — >  ■». 

n  2  w  n 

El^ajYil  <ooasn-^oo  since  h(w,Zi)  is  bounded  as  in  Cl.l.  We  need  to  prove  that  aYi  is  a 
martingale  sequence  which  is  to  show  that 

n  n-1  O'l 

(14) 

(15) 


E[y  aYi  1  2  aiYi]  =  2  aiYi 

iafl  ®  iS5 

From  (11)  and  (12),  we  have 

E[Yi]  =  E[h(w,Zi)  -  h(w)]  =  E{h(w,Zi)]  -  h(w)  =  0 
We  use  (14),  (IS),  and  the  fact  that  Yj  are  i.i.d.  random  variables  wj.t.  i  from  the  definition  of  %  to  write 


n*l 


n-1 


n*l 


n-1 


Et^  aiYi  I  ^  aYi]  =  E[^  aYi  I  ^  aYi]  +  Eta„Y„  I  ^  aiYi] 

=  ^ajYi  +  a„E[Y„]  = 

’^is  shows  that  (14)  holds  and  completes  the  proof  that  ^  aYi  is  a  martingale  sequence.  If  we  choose  at  such  that 
ai^  <  00  as  n  — >  «>,  C1.2  is  satisfied.  Note  that  at  =  c/n  as  in  C1.3  worte  here. 

The  last  step  is  to  solve  the  ODE  (2),  but  first  we  have  to  derive  h(w)  or  E[h(w,Zi)]  where  each  component 
E[h(wJl',Zi)]  can  be  derived  as  follows.  For  the  output  layer  L  from  (8) 

E[h(w^>,zi)l  =  E[(dpii  -  f'(sump^J)  y^i')  E[s?j>l  (16) 

The  term  inside  the  first  expectation  on  the  right  hand  side  of  (16)  is  the  negative  error  gradient  of  the  weight  in  the 
last  layer  with  pattern  p.  Since  we  select  a  pattern  randomly  with  a  uniform  distribution  the  expectation  becomes 
-l/PdEj/dwJ'  where  Ex  is  the  total  error.  The  expectation  on  Sjj  is  by  its  definition  the  probability  that  the  error  e^j’ 
is  selected.  The  probability  must  be  the  same  for  ail  neurons  within  the  same  layer.  We  define  the  probability  as 
qO'.  Consequently,  (16)  becomes 

E[h(wJ>,Zi)l  =  (17) 

P  dw^^ 

Applying  the  same  procedure  fcr  the  output  layer  L  to  the  hidden  layer  L- 1  of  (9)  yields 

EfhfwJ-'lzO]  =  E[^|  wS'Vfsump^-'^)  XpV^  E[sU;U^-'^ 

=  E[ep^-‘>yp^*'>|  E[s,^^  E[s^^-'^  (18) 

In  deriving  (18),  for  simplicity,  we  assume,  w.Lo.g.,  that  5,;]  is  layer  independent,  i.e.  errors  are  set  to  zero 

independently  from  each  layer.  Again  the  term  inside  the  expectation  on  the  right  hand  side  of  (18)  is  the  negative 
error  gradient  of  the  weight  in  the  hidden  layer  L-1  with  pattern  p.  Hence, 

E[h(w^‘ W]  =  -  (19) 

p  dwr ' 
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Eni(w^'^’,Zi)]  can  be  derived  in  the  same  way.  Once  we  take  the  expectation  of  all  the  rest  of  the  exp^tation  is 
just  the  negative  error  gradient  of  the  weight  in  the  hidden  lay^  L-2  with  pattern  p.  From  (10),  the  result  is 


(20) 


In  fact,  from  the  recursive  nature  of  (7),  for  any  layer  1,  we  can  conclude  that 


E(h(w^>A)] 


_  1  dET 


dwj[> 


q(L^(L-i)  q(i: 


(21) 


Notice  that  output  layer  L  has  the  highest  expected  value  (in  absolute  value)  with  a  factor  of  only 
whereas  the  lower  the  hidden  layer  (less  1),  the  lower  the  expected  value  because  of  more  factors  of  q^l  It  should  be 
as  expected  since  some  neuron  errors  are  set  to  0  in  each  layer.  They  are  propagated  back  and  become  less  accurate. 

Now  we  can  solve  (2).  wt  is  a  column  vector  with  each  component  a  weight.  Let's  index  it  from  1  to  D 
where  D  is  the  total  number  of  weights  in  the  network,  i.e.  w,  =  (wi , ... ,  Wk , ... ,  wd)^.  We  have 


(22) 


From  (21),  define  c®-  as  (1/P)  q^t|^'*\..q^''.  Furthermore,  define  Ck  =  c^'-  for  all  corresponding  weights  wk  of  layer  I. 
For  example,  ci  =  C2  =  c^'-  if  wi  and  W2  are  both  in  layer  1.  From  (11)  and  (21),  we  can  now  write 


dEr  „  dEx 


dEx 


dEx, 


i./«  \  /  ^  UE.I  ^  OX-X  ^  ^  C/Cr’I  X 

h(wt)  =  -C2— ... ,  •••  * 

owi  aW2  dwk  dwi) 

From  (2),  (22)  equals  (23),  we  have 

dwk  _ 

- -Ck- - 

dt  dwk 

Let's  examine  dEx/dt,  the  change  of  total  error  with  time.  Using  the  chain  rule,  we  write 


(23) 

(24) 


Substitute  (24)  into  (25)  yields 


dEr  _ 

£  3Ex  dwk 

dt 

3wk  dt 

dEr  _ 

D 

V  -n. 

[dEx  ^ 

dt 

[dwk. 

(25) 

(26) 


The  differential  equation  (2)  has  the  name  of  autonomous  functional-differential  equation.  [4]  covers  it  in 
great  detail.  For  our  case,  a  simplifi^  explanation  is  as  follows. 

Since  Ck  is  greater  than  zero  by  its  definition,  dEr/dt  in  (26)  is  negative.  Hence,  Ex  decreases  in  t  and 
because  Ex  is  always  greater  than  or  equal  to  zero,  it  has  a  limit,.  Moreover,  Ex  is  differentiable,  so  we  can  conclude 
that  dEx/dt  -4  0.  From  (23)  and  (26),  dEx/dt  =  0  if  and  only  if  h(w,)  =  0.  'That  means  dwi/dt »  0  or  Wt  is  at  a  fixed 
point.  Consequently,  we  can  conclude  that  wt  reaches  a  lo^  minimum  w  if  there  exists  one,  the  same  condition 
for  a  normal  bacl^ropagation  algorithm.  This  completes  the  convergence  proof  of  the  reduced  operation 
backpropagation  algorithm. 


Largest  K  Algorithm 

In  the  last  section,  setting  some  of  the  errors  to  zero  randomly  has  been  shown  to  converge,  but  the  rate  of 
convo'gence  is  not  known.  In  this  section,  we  propose  that  the  largest  K  errors  in  absolute  value  in  each  layer  are 
kept  and  the  rest  axe  set  to  zero.  K  can  vary  from  layer  to  layo*.  This  is  intended  to  reduce  the  number  of  iterations. 
The  reasons  can  be  briefly  explained  as  follows. 

The  vector  used  in  a  stochastic  update  of  the  backpropagation  algorithm  is  the  instantaneous  negative 
gradient  of  a  particular  pattern.  Let's  call  it  gp .  The  sum  over  all  the  pattoas  is  g  which  points  in  the  direction  of  a 
negative  gradient  of  the  total  error  in  the  weight  space.  For  a  batch  update,  g  is  used  as  the  vector.  In  the  reduced 
operation  method,  the  vector  in  the  update  is  h(w,zi).  Let's  call  it  hp.  If  we  use  a  batch  update,  it  is  h  instead  and 
the  convergence  proof  is  still  valid,  but  there  will  be  no  1/P  factor  in  (21)  since  pattern  p  is  not  random.  We  update 
the  weights  after  all  patterns  have  been  presented.  Batch  update  will  be  discussed  first  since  it  is  easier  to  understand. 
At  a  particular  point  on  the  error  surface,  g  and  h  can  be  calculated.  If  we  want  a  descent  direction  to  guarantee  that 
Ex  can  be  reduced  after  the  update  from  that  point,  the  directional  derivative  of  Ex  must  be  negative,  i.e.  the  dot 
product  of  the  gradient,  which  is  -g,  and  h  is  negative  or  equivalently,  the  inner  product  g^h  >  0.  This  has  a 
maximum  when  h  =  g.  That  is  what  h^tpens  in  normal  backpropagation  even  though  it  does  not  guarantee  to  reach 
the  local  minimum  faster  since  normally  g  does  not  point  to  the  local  minimum.  Nevertheless,  we  want  the  angle 
between  g  and  h  to  be  less  than  90°  so  that,  at  least,  we  will  be  ^le  to  reduce  Ex  with  appropriate  learning  rate. 
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In  the  reduced  operation  method  by  selecting  the  errors  randomly,  is  sometimes  less  than  zero,  but  on 
the  average  over  a  long  period  of  time,  de^nt  directions  are  achieved.  1^  same  is  true  for  a  stochastic  upd^  of 
normal  backpropagation  where  we  use  gp  in  stead  of  h.  To  undostand  why  the  errors  should  be  chosen  according  to 
their  magnitudes  instead  of  chosen  randomly,  consider  a  case  for  a  batch  update  with  only  one  pattern  (P=l)  and  one 
layer  (output  layejp-  is  bounded  to  be  non-negative  since  h  is  just  g  with  some  components  set  to  zero.  If  we 
select  K  errors,  g^h  is  maximized  when  the  largest  K  errors  in  magnitude  are  chosen.  With  L>1  and  P>1,  it  is  less 
obvious  to  see,  but  choosing  the  errors  according  to  their  magnitudes  will  maximize  the  chance  of  moving  in  the 
descent  directions.  For  a  stochastic  update,  the  idea  is  similar.  We  try  to  make  hp  as  close  to  gp  as  possible,  i.e.  to 
maximize  gp^hp.  so  that  Et  is  reduced  faster.  This  can  be  achkved  by  selecting  die  largest  K  errors. 

Simulation  Results  Discussion 

Due  to  limited  space,  two  simulation  results  will  be  summarized.  In  both  problems,  a  small  constant 
learning  rate  is  used  in  place  of  a  decreasing  learning  rate  as  the  theory  suggests  because  the  latter  is  v^  slow  to 
converge  in  practice.  The  network  has  two  hidden  layers  (L=3)  and  is  fully  coiuiected.  The  first  simul^on  is  a 
classification  problem  where  we  train  the  network  R)  recognize  English  characters.  The  objective  is  to  minimize 
total  OTor  Et  only,  i.e.  we  look  at  the  problem  as  a  non-linear  optimization  |»oblem.  In  most  runs  (different  initial 
weights),  selecting  the  largest  K  errors  reduces  Et  faster  than  settling  the  errors  randomly.  The  smaller  K  becomes 
(selecting  fewer  errors),  the  bigger  the  convergent  ^leed  difference  between  the  two  selection  methods  is.  For 
selecting  K  errors  randomly,  the  larger  K  is,  the  faster  the  network  converges  in  m  ses.  Normal 

badquopagation  is  the  fastest  (requires  minimum  number  of  iterations).  However,  for  selectin;,  .  gest  K  errors, 

normal  backpropagation  is  not  necessarily  the  fastest.  In  may  runs,  K=N/2  is  sometimes  the  fa:>  For  stochastic 
update,  with  50  different  initial  weights  and  the  stopping  criteria  that  Et  reaches  the  point  when  the  average  error  for 
each  ouqHit  is  10%,  we  have  that,  on  the  average,  selecting  the  largest  K=N/2  indeed  requires  75%  of  the  total 
number  of  iterations  compared  to  that  of  normal  backpropagation.  Obviously,  K=N/2  requires  less  number  of  •  and 
OP  operations  in  the  backward  path  for  each  iteration.  So,  it  actually  takes  less  than  75%  of  the  time  in  teal  chip 
(for  this  particular  problem).  'Diis  is  possible  since  in  normal  backpropagation  moving  in  the  negative  gradient 
direction  with  a  finite  step  length  is  not  guaranteed  to  reduce  total  error  the  most. 

The  second  simulation  is  a  continuous  function  approximation  problem.  Testing  patterns  ai  used  to 
monitor  the  network  generalization  performance  instead  of  training  patterns  only  as  in  the  first  one.  Noise  is  also 
injected  to  increase  the  network  abiUty  to  generalize.  The  performance  is  measiued  by  the  number  of  iterations  it 
takes  to  have  overtraining,  i.e.  when  testing  errors  go  up,  and  by  the  total  error  at  the  time.  Selecting  the  largest  K 
errors  (K=N/2)  performs  very  well  against  the  normal  algorithm  in  term  of  generalization  with  and  without  noise 
injection.  On  the  average,  the  plots  are  very  similar.  Again,  selecting  the  largest  K  errors  requires  less  number  of  • 
and  OP  operations  for  each  iteration. 

Hardware  Implementation  and  Comparison 

We  proposed  a  detailed  efBcient  hardware  implementation  [5].  The  prc^msed  architecture  resembles  the  data 
flow  diagram  in  figure  1.  Basically,  each  qieration  becomes  a  unit,  all  worl^g  simultaneously.  The  pipelined  chip 
has  parallel  MACs  (Multiply-Accumulators)  in  the  MAC  unit  to  handle  matrix-vector  multiplications  ^  opoations). 
The  Manchester  carry  chain  based  largest  K  unit  was  also  proposed.  It  comprises  of  many  cells.  Each  cell  executes 
two  phases  of  operations  during  each  cycle:  compare  and  shift  The  numbw  stored  in  each  cell  is  compared  to  the 
input  which  enters  the  unit  serially  each  cycle.  Shifting  and  stming  will  occur  according  to  the  comparison  result. 
The  largest  K  unit  with  N  cells,  in  Y  cycles,  can  output  the  largest  X  out  of  Y  numbers  as  long  as  Y-X  <=  N.  By 
adding  the  largest  K  unit  to  the  chip,  the  number  of  *  operations  can  be  reduced.  Each  cell  takes  about  one-tenth  of 
an  area  of  a  multiply-accumulator  since  we  only  need  a  carry  out  to  compare  two  numbers.  We  performed  SPICE 
simulations  to  show  its  feasibility  for  use  in  a  high  speed  neural  network  processor  chip  such  as  the  Stanford 
Boltzmann  machine  [6],  a  125  MHz  deeply  pipelined  digital  CMOS  processor. 

The  actual  number  of  cycles  to  train  a  network  depends  on  the  architecture  of  the  hardware,  i.e.  how  we 
allocate  resources  for  the  chip.  For  comparison  purpose,  the  assumption  is  that  the  chip  is  very  targe,  to  handle  a 
real  world  |HX)blem,  so  that  memory  and  {Hocesscn-  area  dominate  the  total  area.  If  the  number  of  parallel  MACs  (M) 
in  the  MAC  unit  is  large,  the  assumption  is  valid  since  many  operations  in  figure  1  do  not  scale  up  with  M.  Only 
memory  unit  (to  store  weights  and  their  changes)  and  the  following  units,  which  are  counted  as  processor,  scale  up 
with-M:  multipliers  for  the  OP  operations  in  the  OP  unit,  multipliers  and  adders  for  the  weight  update  unit,  and  the 
largest  K  unit.  The  actual  area  ratio  of  memory  to  processor  depends  on  many  factors  such  as  the  number  of 
pattems/second  that  we  want  the  chip  to  process.  The  Stanford  Boltzmann  machine  has  about  equal  area  for  memory 
and  processor.  Higher  ratio  means  Aat  die  chip  can  handle  bigger  problem,  but  it  takes  longer  time  to  finish. 

For  normal  backpropagation  algorithm,  there  are  M  MACs  in  the  MAC  unit.  We  choose  to  have  M 
multipliers  for  the  OP  unit,  M  multipliers  and  M  adders  for  the  weight  update  unit  to  multiply  each  weight  by  the 
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learning  rate  and  update  it.  The  number  is  reasonable  since  the  OP  unit  and  the  weight  update  unit  must  keep  up 
with  the  MAC  unit  particularly  the  weight  update  unit.  It  must  be  fast  to  update  all  the  weights  in  a  stoch<'istic 
update.  Otherwise,  the  MAC  unit  can  stall.  Assume  that  each  pattern  presentation  takes  1  unit  of  time  for  the 
forward  path  to  complete  all  the  •  operations  and  produce  outputs.  The  backward  path  will  also  require  about  1  unit 
of  time  to  complete  all  the  •  operations.  Assume  that  weight  update  takes  an  additional  1  unit  of  time  idler  the 
backward  path  finishes  to  access  the  weight  memory  and  update  all  the  weights  before  new  pattern  can  be  presented. 
The  total  time  for  normal  backpropagation  is  3  units  of  time  for  each  pattern  presentation  in  the  learning  mode.  The 
XI  architecture  by  Adaptive  Solutions  [7],  one  of  the  most  powerful  neural  network  processors,  requires  6  units  of 
time  for  learning  relative  to  feedforward  computations.  Since  our  proposed  hardware  is  for  the  backpropagation 
algorithm  only,  3  units  of  time  should  be  reasonable  to  assume. 

For  the  reduced  operation  algorithm  in  the  case  of  selecting  half  the  errors,  we  can  trade  off  between  speed 
and  area.  Let's  consider  the  case  where  there  are  M  MACS.  In  this  case,  the  OP  unit  has  M/2  multipliers,  and  the 
weight  update  unit  has  M/2  multipliers  and  M/2  adders.  The  largest  K  unit  needs  M/2  cells  to  select  the  largest 
M/2  out  of  M  errors.  In  terms  of  the  unit  of  time,  it  is  clear  that  this  scheme  takes  1  unit  of  time  for  the  forward 
path,  but  only  takes  0.5  units  of  time  for  the  backward  path.  With  M/2  multipliers  for  the  OP  unit,  it  requires  the 
same  amount  of  time  as  the  normal  case  with  M  multipliers  since  only  half  the  weight  matrix  is  expanded  in  the 
OP  operations.  The  same  reason  applies  for  calculating  new  weight  v^ues,  i.e.  M/2  multipliers  and  M/2  adders  in 
the  weight  update  unit  are  sufficient.  However,  it  will  take  an  additional  0.5  instead  of  1  unit  of  time  to  update  the 
weights  since  only  half  the  weight  memory  is  accessed.  Consequently,  the  total  time  to  learn  one  pattern  is  2  units 
of  time.  That  means  it  takes  2/3  of  the  time  of  normal  algorithm  for  each  iteration.  The  total  number  of  iterations 
is  shown  to  be  less  than  or  about  the  same  as  that  of  a  normal  algorithm.  The  speed  up  could  be  done  in  many  ways 
without  increasing  the  hardware  complexity  and  area  such  as  varying  the  learning  rate  dynamically. 

We  can  now  compare  the  area  of  both  methods.  The  area  is  estimated  by  the  number  of  transistors  since 
all,  counted  as  processor,  can  be  considered  logic  and  have  the  same  area  density  per  transistor.  The  weight, 
activation,  and  learning  rate  values  are  assumed  to  be  8  bits.  Also,  assume  that  the  accumulator  can  accumulate 
additional  8  bits  beyond  the  MSB  of  a  17-bit  product  between  a  weight  and  an  activation  value  with  no  overflow. 
The  number  of  transistors  for  each  resource  is  estimated  in  detail  in  [5].  An  8x8-bit  multiply-accumulator  has  2240 
transistors.  An  8x8-bit  multiplier  has  1910  transistors.  An  8-bit  tdder  has  290  transistors.  One  cell  of  an  8-bit 
largest  K  unit  has  200  transistors  (including  the  control.)  For  normal  algorithm,  the  total  number  of  transistors  is 
thus  (2240  +  1910  +  1910  +  210)M  =  6350M.  For  the  reduced  operation  algorithm,  the  total  number  of  transistors 
is  2240M  +  (1910  +  1910  +  210  +  200)M/2  =  4395M. 

Now,  the  area  saving  can  be  computed.  Since  the  memory  area  assumption  is  the  same  for  both  methods, 
the  area  saving  comes  from  the  processor  part  only.  The  saving  ranges  from  6%  to  20%  when  the  memory  to 
processor  area  decreases  from  4  to  0.5.  In  fact,  we  can  also  trade  off  the  speed  and  area  by  varying  M.  In  conclusion, 
the  reduced  operation  algorithm  is  indeed  better  than  the  normal  baclqpropagation  by  speed  and  area  comparison.  The 
saving  in  the  area  can  be  allocated  for  the  weight  memory  or  other  resources.  Or  it  can  be  used  to  reduce  the  size  of 
the  chip  to  increase  yield  and  reduce  power  consumption. 
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Abstract 

This  paper  investigates  the  incorporation  of  fault  tolerance  at  the  learning  stage  into  Radial 
Basis  Function  (RBF)  networks.  The  approach  is  particularly  attractive  since  the  cost  of 
fault  detection  and  correction  in  a  practical  VLSI  implementation  of  such  networks  could 
be  prohibitive  due  to  the  large  number  of  neurons  and  connections.  The  RBF  networks 
considered  are  applied  to  the  task  of  analog  function  approximation.  A  fairly  general  fault 
model  is  considered  wherein  faulty  neurons  are  assumed  to  be  stuck  at  a  random  value.  Two 
new  learning  methods  based  on  regression  are  proposed  to  learn  the  weights  and  one  new 
regression  based  learning  method  is  proposed  to  learn  the  centers.  The  methods  explicitly 
t^e  into  account  the  mean  squared  error  in  the  objective  function  in  the  presence  of  faults 
and  use  stepwise  selection  methods  to  choose  the  regressors.  Simulation  results  are  presented 
which  show  that  a  considerable  improvement  in  fault  tolerance  can  be  achieved  over  the  non¬ 
fault-tolerant  learning  algorithm. 


1  Introduction 

We  consider  the  fault-tolerance  behavior  of  the  class  of  feed-forward  networks  known  as  Radial 
Basis  Function  (RBF)  networks.  The  RBF  networks  studied  are  utilized  for  the  purpose  of  analog 
function  approximation.  Let  S  =  {(x(j),y(i))  e  R"  x  /i  |  j  =  1,2,3, ...,iV}  be  a  set  of  data 
points  which  is  a  subset  of  the  graph  of  a  function,  /(x).  By  using  the  set  S  to  learn,  it  is  desired 
to  find  an  RBF  network  such  that  when  pven  input  x(j),  it  produces  an  output  which  is,  in  some 
sense,  close  to  y(j). 

The  use  of  RBF  networks  for  solving  analog  function  approximation  problems  was  analyzed 
by  [9]  and  also  by  [5].  RBF  networks  have  been  applied  to  this  task  in  [1],  [4],  [7],  [6]  and  [8].  In 
[8]  it  is  shown  that  RBF  networks  are  a  special  case  of  regularization  networks.  Fault  tolerant 
training  of  feedforward  networks  with  backpropagation  training  algorithm  is  considered  in  [10], 
[11]  and  [12]. 

We  study  the  fault  tolerance  of  an  RBF  network  with  respect  to  the  failures  of  the  hidden 
units.  The  output  of  a  faulty  unit  i  is  assumed  to  be  stuck  at  a  random  value  Z,-  which  is  uniformly 
distributed  over  an  interval  defined  by  the  minimum  and  maximum  values  the  unit  can  assume, 
(in  our  case,  0  and  1,  respectively).  The  measure  of  fault  tolerance  used  is  the  mean  squared  error 
of  the  calculated  outputs  of  the  RBF.  We  assume  that  the  hidden  units  fail  independently  with 
some  probability  p.  Let  Wi  be  a  random  variable  indicating  whether  unit  t  has  failed  or  not.  Wj 
takes  the  value  1  if  the  unit  has  failed  and  0  otherwise.  Thus,  Wi  is  1  with  probability  p  and  0 
with  probability  1  —  p.  Furthermore,  the  Zj ’s  and  W,-  ’s  are  assumed  to  be  independent  of  each 
other. 


2  Fault  Tolerance  Learning:  (FTLl) 

In  the  foQowing  we  define  our  notation.  Let 

{x(j)  €  R",  j  =  1,2,3,  ...,iV}  be  the  set  of  input  points. 
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{y(j)  f  R,  j  =  1,2,3,  be  the  corresponding  set  of  output  points, 
c,‘  €  J2",  t  =  1,2,3,  ...,M  be  the  centers  for  the  radial  basis  functions, 

*  =  1,2,3,  be  the  variance  corresponding  to  the  radial  basis  function, 

Vij  be  the  output  of  the  hidden  unit  for  the  data  point,  i.e.,  r,j  =  exp  We 

define  roj  =  1,  j  =  1,2,3, 

Let  0i,  i  =  0, 1,2,...,M,  be  the  connection  weight  from  the  hidden  unit  to  the  output  unit. 

*  (i),  3  —  1,2,3, AT,  dnotes  the  error  between  y{j)  and  the  actual  output  produced  by  the 
network.  Thus, 

Af 

»(i)  =  II ’■ij  +  for  i  =  1,2,3,..., iV. 

t=0 

To  put  the  above  equation  into  matrix  form,  we  define, 

Y  =  Iy(l),y(2),...,y(A^)P  ,  as  the  desired  output  vector, 

R  =  [Ao,  Ai, ...,  Am]  ,  the  regression  matrix,  where  Ri  =  [*‘ii,ri2, represents  a  column 
vector  of  the  outputs  of  the  hidden  unit  i  for  the  data  points  y(l),y(2),  ...,y(JV), 

©  =  [^0,^1, ,  the  weight  vector, 

®  =  1^  (1),€  (2), ...,«  (N)]^  ,  the  error  vector.  Then 

Y  =  R0  +  E.  (1) 

The  error  signals  c  (i)’s  are  treated  as  random  variables  which  are  assumed  to  be  uncorrelated 
with  the  regressors  and  independent  of  each  other.  The  least  squares  method  minimizes  the 
expectation  of  the  squared  error  E  with  respect  to  0  . 

In  the  first  fault  tolerant  algorithm  called  FTLl,  the  elements  of  the  regressor  matrix  R  are 
modified  to  take  into  account  the  possibility  of  failures  of  the  hidden  units.  We  denote  this 
modified  regressor  matrix  by  R/.  The  (i,  element  of  R/  is  given  by 

Note  that  if  the  unit  is  not  faulty  (IF,  =  0),  then  =  r,j.  On  the  other  hand  if  the  unit 
is  faulty  {W,  =  1),  then  rj^.  =  Zi.  The  regression  equation  Y  =  R/  ©/  +  E/  is  then  used  to 
estimate  0/  which  minimizes  the  expected  mean  squared  error,  ET  E/.  We  get  an  estimate  of 
0/  as 

0/  =  [RJ'  Ry-S  +  (l-p)Q  +  P]-i  rJ  Y. 

where 


'  Jll  Rfo  0  0  •  •  •  0 

0  Rj^Rf,  0  ■■  0 

.  0  0  0  ...  Rj^Rj^ 

'  R^  Ro  0  0  ...  0 

0  Rf  Rt  0  ■  0 

•  •  * 

.  0  0  0  -  Rif  Rm, 
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Figure  1:  Comparison  of  FTLl  and  Fault  Free  Learning  :  26  Centers. 
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We  use  the  following  simulations  framework  for  performance  evaluation.  Simulations  for  all 
the  methods  are  carried  out  for  the  task  of  approximating  a  sine  function,  namely  y  =  sine  (z)  = 
sin(2Tz)/2rx,  where  the  range  of  x  is  [— 1,+1]>  The  set  of  data  points  is  formed  by  choosing  41 
equally  spaced  points  over  the  range  of  x. 

We  run  two  types  of  simulations  for  measuring  the  immunity  of  the  network  to  faults.  With 
the  independent  probability  of  failure,  p,  for  each  hidden  unit,  the  network  is  run  over  all  the 
data  points  and  the  mean  squared  error  is  measured.  This  step  is  repeated  10,000  times  and 
the  average  mean  squared  error  (which  is  a  measure  of  fault  tolerance)  is  evaluated.  To  measure 
robustness,  instead  of  assigning  a  probability  of  failure  to  the  hidden  units,  the  units  are  made 
to  fail  one  at  a  time.  Thus  one  of  the  hidden  units  is  made  faulty  (stuck  at  a  random  value  Z), 
and  the  network  is  run  for  all  the  data  points.  The  mean  squared  error  is  measured.  This  step  is 
repeated  till  each  of  the  hidden  units  is  made  faulty  once.  The  average  mean  squared  error  is  then 
calculated  which  is  a  measure  of  robustness  of  the  system  to  a  failure  of  one  unit.  In  addition, 
since  the  probability  of  failure  is  very  low  in  practice,  we  consider  only  one  failure. 

The  performance  of  the  network  is  plotted  in  Figures  1  and  2.  In  Figure  1  we  compare  the 
performance  of  fault  free  learning  with  that  of  FTLl  algorithm.  In  FTLl  learning  the  network 
was  trained  for  the  value  of  p  given  on  the  horizontal  axis.  The  performance  of  the  network  is  then 
evaluated  for  three  cases:  1)  when  no  faults  occur  (fault-free),  2)  faults  occur  with  probability  p 
(faults  with  p),  3)  One  hidden  unit  has  failed  (robustness).  For  fault-free  training  the  performance 
is  evaluated  assuming  faults  occur  with  probability  p.  In  Figure  2,  the  FTLl  algorithm  is  used 
for  training  assuming  a  fixed  value  of  p  (given  in  the  f  gure).  The  performance  is  then  evaluated 
for  the  value  of  p  ^ven  on  the  axis. 

From  these  figures  we  can  observe  the  following. 

•  As  expected,  the  network  trained  with  FTLl  results  in  a  lower  mean  squared  error  than  the 
one  trsuned  with  fault  free  learning  for  the  probability  of  failure  under  consideration. 


Figure  2:  Comparison  of  FTLl  for  Different  Values  of  p  :  26  Centers. 

•  The  response  to  different  values  of  probability  of  failure  becomes  more  and  more  flat  as  the 
value  of  p  for  FTLl  is  increased.  Thus,  over  a  range  of  probability  of  failures  less  than  p, 
the  network  exhibits  very  small  deviation  in  the  mean  squared  error  from  that  in  the  fault 
free  case. 

•  In  all  the  simulations  we  have  run,  FTLl  always  outperforms  fault-free  learning  and  the 
improvement  increases  as  the  number  of  centers  increase.  This  can  be  attributed  to  the 
following.  We  have  observed  that  as  the  number  of  centers  increases,  the  values  of  the 
weights  become  very  large  in  the  case  of  fault  free  learning.  Consequently,  failure  of  one 
hidden  unit  causes  a  large  error  in  the  computed  output.  On  the  other  hand,  with  FTLl  the 
values  of  the  weights  remain  small  even  as  the  number  of  centers  increases.  This  phenomenon 
is  similar  to  the  weight  control  scheme  of  [12]. 


3  Fault  Tolerance  Learning:  (FTL2) 

In  practice,  we  often  do  not  have  apriori  knowledge  of  the  probability  of  failure  p.  Also  we  are 
more  interested  in  studying  performance  of  the  network  in  presence  of  failures.  Further,  assuming 
that  failure  is  a  very  low  probability  event,  we  would  be  interested  in  failure  of  at  most  one  hidden 
unit. 

FTL2  is  a  learning  algorithm  which  concentrates  on  one  hidden  unit  failure.  It  does  not 
depend  on  the  probability  of  failure  p  of  the  hidden  units.  In  FTL2,  we  consider  all  the  possible 
cases  in  which  one  hidden  unit  is  faulty  and  the  case  in  which  none  of  the  units  is  faulty.  The 
expectation  of  the  sum  of  the  mean  squau-ed  errors  over  all  these  cases  is  minimized.  The  estimate 
O  f  minimizes  the  expectation  of  the  totad  mean  squared  error. 

Denote  R/j  ats  the  regressor  matrix  representing  the  case  in  which  the  hidden  unit  is  faulty. 
This  matrix  is  obtained  by  replacing  the  1*^  column  of  regressor  matrix  R  by  Z;.  Since  the 
columns  of  R/  are  indexed  from  0  through  M,  denote  R/(m+i)  ^  regressor  matrix  for  the 
case  of  no  failures.  Thus  we  have, 

Y  =  R//©/-|-E,  for/  =  0,l,2,...,M,M-t-l 
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Figure  3:  Comparison  of  FTL2  and  Fault  Free  Learning. 


The  estimate  of  0/  is  found  from  minimizing  the  following  expression. 

M+l  M+l 

^  EfE,=  Y,(Y-Kfi  Qf{Y-Kji  &j). 

1=0  1=0 


We  get 


•A/+1 

E 


1=0 


-1 


Kji  Rfi  +  P' 


M+l 


1=0 


where  P'  =  P/p.  The  networks  were  trained  using  fault  free  learning  as  well  as  FTL2.  The  fault 
tolerance  and  robustness  simulations  were  run  on  each  of  the  networks  and  the  results  are  shown 
in  Figure  3.  From  this  figure  we  can  observe  that 

•  In  the  presence  of  one  unit  failure,  FTL2  shows  a  considerable  improvement  in  performance 
over  fault  free  learning. 

•  As  the  number  of  centers  increases,  the  robustness  of  fault  free  case  can  actually  worsen 
as  there  is  no  control  over  it  as  opposed  to  FTL2  which  guarantees  an  improvement  in 
robustness.  This  is  similar  to  the  case  of  FTLl. 

4  Choice  of  Centers  for  Improving  Robustness  :  LSR  Method 

In  this  section,  we  suggest  a  stepwise  method  to  choose  the  centers  based  on  robustness  consid¬ 
erations.  The  method  is  called  LSR  since  it  uses  an  algorithm  similar  to  Least  Squares  and  is 
designed  for  improving  Robustness.  We  assume  that  the  network  is  to  be  trained  with  FTL2  to 
achieve  an  improvement  in  robustness. 
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Figure  4:  Comparison  of  LSR  and  OLS  with  FTL2  as  Learning  Algorithm. 

All  the  data  points  are  possible  candidates  for  the  centers  to  be  chosen.  Let  C  denote  the 
set  of  data  points  chosen  to  be  the  centers.  Initially,  the  set  C  is  empty.  In  each  step  of  the 
algorithm  one  data  point  is  added  to  the  set  C.  At  the  step  we  calculate  the  value  of  the 
objective  function  for  FTL2  using  each  of  the  data  points  not  in  the  set  C  as  a  center  along  with 
the  (ib  —  1)  data  points  which  are  in  the  set  C.  The  data  point  which  results  in  the  minimum 
value  of  the  objective  function  is  added  to  the  set  C.  The  algorithm  is  terminated  when  the  value 
of  the  objective  function  is  within  the  given  acceptable  limit. 

The  choice  of  the  centers  using  the  method  described  was  compared  with  the  OLS  method  [3] 
for  different  numbers  of  centers.  In  Figure  4  we  plot  the  mean  squared  error  as  a  function  of  the 
number  of  centers.  It  should  be  noted  that  here  we  are  using  the  FTL2  algorithm  for  both  LSR 
and  OLS.  Therefore,  the  difference  in  performance  is  only  due  to  the  choice  of  the  centers. 

IVom  this  figure,  it  can  be  see  that 

•  The  network  whose  centers  are  chosen  using  LSR  and  trmned  with  FTL2  results  in  a  better 
performance  than  the  one  using  OLS  for  selecting  the  centers  both  when  there  are  no  faults 
as  well  as  when  one  fault  occurs. 


5  Conclusions 

From  the  discussions  and  the  results  presented  in  the  previous  sections  we  can  note  the  following. 

If  the  probability  of  failure  p  for  hidden  units  is  specified  apriori,  then  the  network  learnt  with 
FTLl  performs  better  than  the  network  learnt  with  fault-free  learning  in  terms  of  fault  tolerance. 
This  means  that  in  the  presence  of  fulures  the  network  trmned  with  FTLl  will  result  in  a  lower 
mean  squared  error  than  the  network  trained  with  OLS. 

FTL2  which  does  not  need  the  value  of  p  can  be  used  for  achieving  an  improvement  in  robust¬ 
ness.  Also  FTL2  is  more  practical  than  FTLl  since  it  does  not  consider  the  probabilistic  nature 
of  faults  but  rather  considers  the  performance  of  the  network  given  that  a  fault  has  occurred. 


in-468 


We  can  achieve  better  robustness  for  a  ^ven  number  of  centers,  when  the  centers  are  selected 
from  a  set  of  data  points  by  the  LSR  method  instead  of  the  OLS  method  and  FTL2  is  used  for 
learning  the  weights.  The  reason  being  that  LSR  is  based  upon  FTL2  itself,  which  is  a  learning 
method  for  improving  robustness  as  opposed  to  OLS  which  is  not  based  on  fault  tolerance  learning. 

If  it  is  required  that  the  mean  squared  error  be  within  specified  limits  under  fault  free  and 
one  fault  cases,  then  using  FTL2  and  appropriate  number  of  centers,  this  criterion  can  always  be 
met. 
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Abstract 

This  paper  presents  a  new  neural  network  learning  theory  that  is  much  more  brsiin-like  than  classical 
connectionist  learning.  Unlike  connectionist  learning,  the  algorithms  here  both  design  and  treun  networks 
and  do  so  in  polynomial  time.  They  are  also  immune  to  other  learning  problems  (local  minima,  oscillation, 
catastrophic  forgetting).  This  paper  shows  how  a  RBF-like  net  ciui  be  generated  under  this  theory  for 
classification  problems. 

1.  Introduction  -  A  Robust  and  EfRcient  Learning  Theory 

The  science  of  artificial  neural  networks  (ANN)  needs  a  robust  theory  for  generating  neural  networks. 
Lack  of  a  robust  learning  theory  has  been  a  significant  impediment  to  the  field.  A  rigorous  theory  for 
ANN  should  include  learning  methods  that  adhere  to  the  following  stringent  performance  criteria  and 
tasks;  1.  Perform  Network  Design  Task:  A  neural  network  learning  method  must  be  able  to  design  an 
appropriate  network  for  a  given  problem,  since  it  is  a  task  performed  by  the  brain.  A  pre-designed  net 
should  not  be  provided  to  it  as  part  of  its  external  input,  since  it  never  is  an  external  input  to  the  brain. 

2.  Robustness  in  Learning:  The  method  must  be  robust  so  as  not  to  have  the  problems  of  local  minima, 
oscillation  and  catastrophic  interference  and/or  similar  learning  difficulties.  The  brain  does  not  exhibit 
such  problems.  3.  Quickness  in  Learning:  The  method  must,  be  quick  in  its  learning  and  learn  rapidly 
from  only  a  few  examples,  much  as  humans  do.  For  example,  an  on-line  method  that  learns  from  only  10 
examples  is  quicker  than  one  that  needs  a  1,000  examples.  4.  Efficiency  in  Learning:  The  method  must  be 
computationally  efficient  in  its  learning  when  provided  with  a  finite  number  of  training  examples.  It  must 
be  able  to  both  design  and  train  an  appropriate  net  in  polynomial  time.  That  is,  given  n  examples,  the 
learning  time  must  be  a  polynomial  function  of  n.  5.  Generalization  in  Learning:  The  method  must  be  able 
to  generalize  reasonably  well  so  that  only  a  small  amount  of  network  resources  is  used.  That  is,  it  must  try 
to  design  the  smallest  possible  net.  This  characteristic  must  be  an  explicit  part  of  the  algorithm. 

This  theory  defines  learning  principles  that  are  obviously  much  more  brain-like  than  those  of  classical 
connectionist  theory.  Judging  by  these  algorithmic  characteristics,  connectionist  learning  is  not  very  powerful 
or  robust.  First  of  all,  it  does  not  even  address  the  issue  of  network  design,  a  task  that  should  be  central  to  any 
learning  theory.  It  b  also  plagued  by  efficiency  (lack  of  polynomial  time  complexity,  need  for  excessive  number 
of  teaching  examples)  and  robustness  problems  (local  minima,  oscillation,  and  catastrophic  interference), 
problems  that  are  partly  acquired  from  its  attempt  to  learn  without  using  memory.  Classical  connectionist 
learning,  therefore,  is  not  very  brain-like  at  all.  Several  algorithms  have  been  previously  developed  for 
perceptrons  that  satisfy  these  learning  principles  (Roy  et  al.  [1991,  1993],  Mukhopadhyay  et  d.  [1993]).  The 
algorithm  presented  here  shows  RBF-type  nets  can  also  be  constructed  using  these  same  learning  principles. 
2.  Radial  Basis  F\inction  (RBF)  Nets  -  Background 

RBF  nets  belong  to  the  group  of  kernel  function  nets  that  utilize  simple  kernel  functions  whose  responses 
are  essentially  local  in  nature.  The  net  has  one  hidden  and  one  output  layer.  Each  hidden  node  is  a  kernel 
function.  An  output  node  computes  the  weighted  sum  of  the  hidden  node  outputs.  The  gaussian  function 
is  a  popular  kernel  function.  The  design  and  training  of  a  RBF  net  consists  of  1)  determining  how  many 
kernal  functions  to  use,  2)  finding  their  centers  and  widths,  and  3)  finding  the  weights  that  connect  them  to 
an  output  node. 

The  following  notation  is  used.  An  input  pattern  is  given  by  the  N-dimensional  vector  x,x  = 
(Xi,  X2, . . . ,  Xat).  K  denotes  the  total  number  of  classes.  The  method  is  for  supervised  learning  where 
the  training  set  xi,Z2, . . .  ,x„  is  a  set  of  sample  patterns  with  known  classification.  The  input  Fj{x)  to  the 
j**  output  node  is  given  by  Fj(x)  =  =  ^(11  *  —  Q  ||, «',).  Here,  Q  is  the  number 

of  hidden  nodes,  G,(x)  is  the  response  function  of  the  hidden  node,  R  is  a  radially  asymmetric  kernel 
function,  Cq  =  {Cqi .  --Cqs)  and  Wq  =  (wqi...Wq/i/)  are  the  center  and  widths  of  the  9'*  kernel  function,  and 
hjq  is  the  weight  connecting  the  9**  hidden  node  to  the  j***  output  node.  There  is  one  output  node  for  each 
class.  Here,  an  asymmetric  gaussian  is  chosen  as  the  kernel  function;  G,(x)  =  cxp(—  — Xj)^/«>^j). 

Several  RBF  algorithms  have  been  proposed  recently.  Significant  contributions  include  those  by  Powell  [1987], 
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Moody  and  Darken  [1989],  Broomhead  and  Lowe  [1988],  Musavi  et  al.  [1992],  Platt  [1991]  and  others. 

3.  Basic  Ideas  and  the  Algorithm 

The  basic  idea  is  to  cover  a  class  region  with  a  set  of  gaussians.  A  function  F(x)  is  said  to  cover  a  certain 
class  region  if  it  is  slightly  positive  (F(z)  >  e)  for  patterns  in  that  class  and  zero  or  negative  for  patterns 
outside  that  class.  Let  pk  gaussians  be  required  to  cover  a  certain  class  k.  The  covering  function  Ft(z), 
being  a  linear  combination  of  pt  gaussians,  will  be  Ft(x)  =  h,G,(z),  where  G,(z),9  =  1, . . .  ,pt,  is 

the  9*^  gaussian  used  to  cover  class  k  and  the  corresponding  connection  weight.  A  pattern  z,  therefore, 
will  be  in  class  k  if  F|b(z)  >  (t  and  not  in  class  k  if  Ft(z)  <  0.  Hence,  the  RBF  net  is  modified  to  add 
thresholding  at  the  output  node.  Furthermore,  when  the  effect  of  a  gaussian  unit  is  small,  it  is  ignored.  This 
requires  truncated  gaussian  units  as  follows:  G'(z)  =  G,(z)  if  G,(z)  >  ^,  =  0  otherwise,  where  G^(z)  is  a 
truncated  gaussian  function  and  ^  a  small  constant.  In  experiments,  (j>  was  set  to  10~^.  Thus,  the  function 
Ft(z)  is  redefined  as  Ft(z)  =  Thus  the  modified  RBF  net  has  truncation  at  the  hidden 

nodes  and  thresholding  at  the  output  nodes. 

In  general,  let  pt  be  the  number  of  gaussians  required  to  cover  class  k^k  =  l,...,/f,  Ffc(z)  be  the 
covering  function  (mask)  and  Gj(z), . . .  ,Gp,^(z)  be  the  corresponding  set  of  gaussians.  Then  a  pattern  x' 
will  belong  to  class  j  if  and  only  if  its  mask  F)(z)  is  at  least  slightly  positive,  and  the  masks  for  all  other 
classes  are  zero  or  negative.  Here,  each  mask  Fj{x)  will  have  its  own  threshold  value  Cj  as  determined  during 
its  construction.  In  mathematical  notation,  a  pattern  z'  is  in  class  j,  if  and  only  if  Fj{x')  >  Cj  and  Ft(z')  <  0 
for  all  ib  j,  ib  =  1, . . . ,  if  If  all  masks  have  values  equal  to  or  below  zero,  the  input  cannot  be  classified.  If 
masks  from  two  or  more  classes  have  values  above  their  (-thresholds,  then  also  the  input  cannot  be  classified, 
unless  the  maximum  of  the  mask  values  is  used  to  determine  class. 

Let  TRc  be  the  set  of  pattern  vectors  of  any  class  k  whose  masking  is  desired  and  TRsc  be  the 
corresponding  set  of  non-class  k  vectors,  where  TR  =  TRc  U  TRnc  is  the  total  training  set.  Suppose 
m(=  pt)  gaussians  are  available  to  cover  the  class.  The  following  linear  program  is  solved  to  determine  the 
m  weights  h  =  {hi, ,  hm)  for  the  m  gaussians  that  minimize  the  classification  error: 


Minimize  a  ^  d,-  +  0  ^  di  (1) 

UTRc  itTR  o 

subject  to 

^k{xi)  +  di  >(t,  i  €  TRc,  (2) 

Fi(zi)  -  d,  <  0,  If  TRnc  (3) 

d,  >  0,  »  €  TR,  (4) 

(fc  >  a  small  positive  constant,  h  in  Fi(z)  unrestricted  in  sign,  (5) 


where  d^’s  are  external  deviation  variables  and  a  and  0  are  the  weights  for  the  in-class  and  out-of-class 
deviations,  respectively. 

3.1  Generation  of  Gaussian  Units 

Here,  gaussians  are  not  purely  local  units.  A  variety  of  overlapping  gaussians  are  created  for  masking. 
Though  both  “fat”  and  “narrow”  gaussians  can  be  created,  the  “fat”  ones  are  created  first  in  an  attempt 
to  generalize  better  using  broad  territorial  features.  Thus,  the  gaussians  are  generated  incrementally  and 
as  new  gaussians  are  generated,  the  LP  model  (l)-(5)  is  solved  for  each  class  using  all  of  the  gaussians  for 
that  class  and  the  resulting  network  evaluated.  Whenever  incremental  change  in  the  error  rate  (training  and 
testing)  becomes  small  or  overhtting  occurs  on  the  trsuning  set,  masking  of  a  class  is  complete. 

The  gaussians  for  a  given  class  k  at  any  stage  h  eire  randomly  selected  in  the  following  way.  First,  a 
majority  criteria  is  specified  for  that  stage.  Denote  this  majority  criteria  by  Mjl  for  the  k*''  class  at  stage 
h.  A  of  60%  means  that  at  least  60%  of  the  patterns  covered  by  a  gaussian ’s  core  (within  one  standard 
deviation)  should  belong  to  class  k.  starts  at  50%  and  can  go  up  to  100%.  To  generate  a  gaussian, 

randomly  select  a  pattern  Xg  of  class  k  from  the  training  set  and  search  for  all  other  patterns  in  some 
^-neighborhood  of  z,.  This  6-neighborhood  is  actually  an  ellipsoid  defined  by  different  widths  in  different 
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directions.  Let  V^-  be  the  set  of  patterns  within  the  ^-neighborhood  of  such  initial  vector  x'^.  If  the  set 
Vi  satisfies  the  majority  criteria,  it  can  be  used  to  create  a  gaussian.  If  a  gaussian  is  created  from  V,  the 
centroid  of  its  pattern  vectors  become  the  center  Ci  and  the  standard  deviation  of  their  distances  from  Ci 
in  direction  j  becomes  Wij,  the  width  in  direction  j.  Whether  a  gaussian  is  defined  or  not,  the  patterns  in 
Vi  are  removed  from  the  training  set.  This  process  of  randomly  picking  a  pattern  Xg  of  class  k  from  the 
remaining  training  set  and  searching  for  patterns  in  its  6-neighborhood  is  then  repeated.  A  set  of  gaussians 
(j  =  1, . . .  ,Qh)  can  be  found  at  the  stage  by  repeating  this  process  until  the  remaining  training  set  is 
empty  of  class  k  vectors. 

The  procedure  described  above  is  embellished  slightly  whereby  the  6-neighborhood  is  allowed  to  grow 
until  it  reaches  a  certain  maximum  size  of  or  until  it  no  longer  satisfies  the  majority  criteria.  Let  Sr  be 
the  width  of  the  elliptical  6-neighborhood  at  growth  stage  r,  Sr  =  (6ri , . . . ,  6rjv).  So  the  process  of  generating 
a  gaussian  starts  with  an  initial  6,6o,  and  then  increases  6  by  a  fixed  increment  (6r  =  6r-i  -t-  A6)  where 
AS  =  (A6i, . . . ,  A6Ar).  So,  in  this  embellished  method,  the  gaussians  at  any  stage  h  are  randomly  selected 
as  follows.  Randomly  select  a  pattern  Xg  of  class  k  from  the  remaining  training  set  and  search  for  all  other 
patterns  in  the  6o-neighborhood  of  Xg.  Let  V^  be  the  set  of  patterns  within  the  6r-neighborhood  of  such 
initial  vector  x'g.  A  neighborhood  can  be  grown  only  if  the  current  pattern  set  V^''  from  the  6r-neighborhood 
satisfies  the  majority  criteria  and  if  Sr  ^  6ma«.  If  the  current  set  V^’'  fails  on  the  majority  criteria,  the 
previous  set  V[~^  is  used  to  create  a  gaussian.  The  center  and  widths  of  a  gaussian  are  determined  in  the 
same  manner  as  before.  Once  a  gaussian  is  defined  from  a  pattern  set  (V^  or  that  set  is  removed 

from  the  remmning  training  set.  This  process  of  randomly  picking  a  pattern  vector  Xg  of  class  k  from 
the  remaining  training  set  and  growing  the  largest  possible  gaussian  around  it  is  then  repeated.  A  set  of 
gaussians  (j  =  1, . . .  ,Qh)  for  class  k  is  found  at  the  stage  by  repeating  this  process  until  the  remaining 
training  set  is  empty  of  class  k  patterns. 

3.2  The  Algorithm 

The  algorithm  is  summarized  below.  The  following  notation  is  used.  I  and  R  denote  the  initial  and 
remaining  training  sets,  respectively.  6m<ix  is  the  maximum  neighborhood  radius  and  Sr  is  the  neighborhood 
width  at  the  step  of  neighborhood  growth.  AS  is  the  Sr  increment  at  each  step.  V{  is  the  set  of  patterns 
within  the  6r -neighborhood  of  t‘*  initial  vector  x*g.  Pi(k)  denotes  the  percentage  of  class  k  members  in  V^ . 
Ni  denotes  the  number  of  vectors  in  VI .  A  is  the  stage  counter,  is  the  minimum  percentage  of  class  k 
members  in  V^  in  stage  h  and  A6  is  the  increment  for  6h  st  each  stage.  Qj  corresponds  to  the  number  of 
gaussians  created  in  stage  j  and  pt,  =  Qj  i^  number  of  gaussians  till  stage  h.  Cki  and  wn  are 

the  center  and  widths  respective  of  the  gaussian  unit  of  class  k.  TRE^  and  TSEh  are  the  training  and 
testing  set  errors,  respectively,  at  the  h**'  stage  for  class  k.  0  is  the  minimum  number  of  points  required  to 
form  a  gaussian  and  <r  =  {<r\, . . .  ,<tn)  the  standard  deviations  of  the  distances  from  the  centroid  in  various 
directions  for  class  k.  S^ax  is  set  to  some  multiple  of  <r.  The  fixed  growth  step  in  each  direction  i,  A6,',  is 
set  to  or,-/L,  where  L  is  the  desired  number  of  growth  steps.  L  was  set  to  25  for  computational  experiments. 

The  Gaussian  Masking  (GM)  Algorithm 

(1)  Initialize  class  counter;  k  =  Q. 

(2)  Increment  class  counter;  ib  =  ik-|-l.  U  k  >  K,  stop.  Else,  initialize  gaussian  counter;  j  =  0. 

(3)  Initialize  stage  counter  and  constants;  h  =  0,6ma«  =  ASi  =  (fi(L,  A9  =  some  constant  (e.g.  10%). 

(4)  Increment  stage  counter;  A  =  A  +  1.  Increase  majority  criteria;  if  A  >  1,^*  =  0h-\  +  A9\  otherwise 
tffc=50%.  If  9h  >  100%,  go  to  (2). 

(5)  Select  gaussian  units  for  the  A‘^  stage;  i  =  0,R  =  I,  Qh  —  0. 

(a)  Set  i  =  i  -1- 1,  r  =  l,6r  =  A6. 

(b)  Select  an  input  pattern  vector  Xg  of  class  k  at  random  from  R,  the  remaining  training  set. 

(c)  Search  for  all  pattern  vectors  in  R  within  a  6, -neighborhood  of  z* .  Let  this  set  of  vectors  be  V{ . 
(i)  If  Pi[k)  <  9h  and  r  >  1,  set  r  =  r  —  1,  go  to  (e);  (ii)  if  Pi(k)  >  9h  and  r  >  1,  go  to  (d)  to  expand 
neighborhood;  (iii)  if  Pi{k)  <  9h  and  r  =  1,  go  to  (h);  (iv)  if  Pi{k)  >  9h  and  r  =  1,  go  to  (d). 

(d)  Set  r  =  r  -I- 1, 6r  =  Sr-i  -f  A6.  If  6,  >  6mar,  set  r  =  r  —  1,  go  to  (e).  Else,  go  to  (c). 

(e)  Remove  the  set  V^  from  R  :  R=  R—Vf.  If  Aff  <  go  to  (g). 

(f)  Set  J  =  J  -f  1.  Compute  the  center  Ckj  and  widths  Wkj  of  the  j*'*  gaussian  for  class  k.  Qh  =  Qh  + 1- 
Ckj  =centroid  of  the  set  V[ ,  and 
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Wk)  =8tandard  deviations  of  the  distances  from  their  centroid  of  the  patterns  in 

(g)  If  R  is  not  empty,  go  to  (a),  else  go  to  (6). 

(h)  Remove  the  set  from  R-.  R  =  R—Vi-  If  f?  is  not  empty,  go  to  (a),  else  go  to  (6). 

(6)  From  the  set  pk,  eliminate  similar  gaussians  (i.e.  those  with  very  close  centers  and  widths).  Let  be 
the  new  set  of  gausuans  after  this  elimination. 

(7)  Solve  LP  (l)-(5)  for  class  k  mask  using  j/f,  number  of  gaussians. 

(8)  Compute  TSE^  amd  TREi^  for  class  k.  (a)  If  TSEi^  <  TSE^^^l,  go  to  (4);  (b)  If  TSEit  >  TSEt^-\  and 
TREh  >  TREh-i,  80  to  (4);  (c)  Otherwise,  overfitting  has  occurred.  Use  the  mask  generated  in  the 
previous  stage  as  class  k  mask.  Go  to  (2)  to  mask  next  class. 

Other  stopping  criteria,  like  maximum  number  of  gaussians  used  or  incremental  change  in  TSE,  can  also  be 
used.  Polynomial  time  convergence  of  the  GM  algorithm  can  be  proven  in  a  manner  similar  to  Roy  et  al. 
[1993]. 

4.  Computational  Results 

All  problems  were  solved  on  a  SUN  Sparc2  workstation.  Linear  programs  were  solved  using  Roy 
Marsten’s  OBI  interior  point  code  from  Georgia  Institute  of  Technology.  The  dual  log  barrier  penalty 
method  of  OBI  was  used.  The  weights  in  LP  (1)'(5),  a  and  were  set  to  1  in  all  cases. 

Several  well-known  problems  were  solved  using  this  method.  Only  a  few  results  are  reported  here.  A 
2-class,  2-dimen8ional  problem,  described  in  Musavi  et  al.  [1992],  was  solved  where  the  first  class  has  a  zero 
mean  vector  with  identity  covariance  matrix  and  the  second  class  has  a  mean  [1,2]  and  a  diagonal  covariance 
matrix  with  values  0.01  and  4.  The  estimated  optimal  error  rate  is  6%.  The  GM  algorithm  obtained  an 
error  rate  of  8.75%  using  only  11  gaussians  (up  to  80%  majority  gaussians).  Musavi  et  al.  [1992]  achieved 
an  error  rate  of  9.26%  with  86  gaussians.  The  GM  algorithm  used  200  points  for  training.  Mangasarian  et 
al.  [1990]  describes  a  breast  cancer  diagnosis  problem.  The  data,  from  University  of  Wisconsin  Hospitals, 
contains  608  cases,  each  case  having  9  measurements  with  values  between  1  and  10.  Of  the  608  cases,  379 
were  benign  and  the  rest  malignant.  Of  the  608,  405  were  used  for  training  and  rest  used  for  testing.  An 
error  rate  of  4.43%  was  obtained  by  the  GM  al^rithm  using  7  gaussians.  Mangasarian  et  al.  [1990]  report 
average  error  rates  of  2.56%  and  6.10%  with  their  MSMl  and  MSM  methods  respectively. 
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Abstract  Some  further  results  are  proposed  for  the  La  convergence  rates  of  Kernel  Regression  Estimators 
(KRE)  and  Radial  Basis  Function  (RBF)  nets  given  in  Xu,  Krayzek  Sr  YuUle  (1992Sr93).  Instead  of  studying  the 
convergence  properties  of  the  L2  error,  here  we  study  the  convergence  properties  of  the  MISE  (Mean  Integrated 
Square  Error).  It  will  be  shown  that  the  upper  bounds  for  the  convergence  rate  of  MISE  are  tighter  than  those  for 
convergence  rate  of  La  error  given  in  Xu,  Krsysak  St  Yuilfe  (1992S£93),  under  milder  conditions. 

1.  Introduction 

A  number  of  thraretical  results  on  Radial  Basis  Function  (RBF)  networks  have  been  obtained,  see  Xu,  Krzyzak 
Sc  Yuille  (1993)  for  a  long  list  of  references.  It  has  been  shown  that  the  RBF  net  can  be  naturally  derived  from 
the  regttlarisation  theory  (Poggio  Sc  Girosi,  1989;  Yuille  Sc  Grzywacz,  1989),  and  that  RBF  nets  have  the  universal 
approximation  ability  (Hartman,  Keeler  Sc  Kow^ld,  1989;  Park  Sc  Sandberg,  1991bl993)  as  well  as  the  so-called 
bmt  approximation  ability  (Girosi  Sc  Poggio,  1989).  In  addition,  RBF  nets  can  also  be  related  to  Parzen  Window 
estimators  of  probability  density  (it  can  be  considered  a  special  example  of  am  RBF  net)  and  probabilistic  neural 
networks  (Spedit,  1990)  which  are  based  on  Pauzen  window  estimator. 

Recent  theoretical  studies  on  RBF  nets  gave  convergence  rates  of  approximation  and  generalization  error  in  terms 
of  the  size  of  RBF  nets  (i.e.,  the  number  of  basis  functions)  (Xu,  Krzyzak,  Yuille,  1992  &93;  Girosi  Sc  Anzellotti, 
1992;  and  Corradit  Sc  White,  1992).  In  Xu,  Krzyzak,  Yuille  (1992&93),  the  connection  between  RBF  nets  amd 
the  Kernel  Regression  Estimator  (KRE)  has  been  established.  It  has  been  shown  that  KRE  can  be  regarded  as  a 
paurticulau  kind  of  an  RBF  net.  Using  theoretical  results  about  KRE,  a  number  of  interesting  theoretical  results  for 
RBF  nets  have  been  obtained.  First,  upper  bounds  have  been  given  for  the  pointwise  and  La  convergence  rates  of  the 
approximation  error  with  respect  to  the  number  n  of  baisis  functions;  An  example  of  such  bound  is  for 

the  La  convergence  rate  on  approximating  function  R(x)  in  the  Lipschitz  function  clakss.  ,  or  for  La 

convergence  rate  on  approximating  function  R(x)  in  the  class  of  functions  which  have  order-9  (q  >  1)  square  integrable 
derivatives  t  ,  where  d  is  the  dimension  of  x.  S^ond,  the  learnability  of  RBF  nets  has  been  proved  by  showing  the 
existence  of  a  consistent  estimator  for  RBF  nets  constructively.  Third,  upper  bounds  have  also  been  provided  for 
the  pointwise  and  La  convergence  rates  of  the  best  consistent  estimator  for  RBF  nets  as  n  and  N  (the  number  of  the 
learning  samples,  N  >n)  tend  to  00.  Examples  of  such  bounds  are  0(n~*“^^*“'*’‘*^),  N  >  n  or  0(n“**^^*®'*'‘*^),  N  >n 
tor  La  convergence  rates  for  the  two  function  classes  described  above.  Fourth,  some  theoretical  results  on  selecting 
the  appropriate  size  of  the  receptive  field  of  the  radial  basis  function  have  b^n  provided  too.  Nearly  in  the  same 
period,  Girosi  Sc  Anzellotti  (1992)  and  Corradit  Sc  White(1992)  also  proposed  some  results  on  RBF  net  convergence 
rates  of  the  approximation  error.  However,  their  studies  differ  from  Xu,  Krzyzak  Sc  Yuille  (I992&93)  in  several 
aspects:  (1)  the  unnormalized  RBF  net  has  been  considered  instead  of  the  normalized  RBF  nets  considered  by  Xu, 
Krzyzak  &.  Yuille  (1992&93);  (2)  the  tools  used  were  totally  different;  (2)  the  results  of  Girosi  L  Anzellotti  (1992) 
and  Corradit  Sc  White(1992)  concern  only  convergence  rate  of  the  approximation  error;  while  Xu,  Krzyzak  Sc  Yuille 
(1992&93)  study  much  more  than  just  approximation  error;  (4)  the  conditions  assumed  and  the  detailed  results  are 
also  different,  though  the  rates  obtained  in  Xu,  Krzyzak  Sc  Yuille  (1992&93),  Girosi  Sc  Anzellotti  (1992),  and  Corradit 
Sc  Wliite(1992)  are  consistent. 

This  paper  propose  some  further  complementary  results  for  those  obtained  in  Xu,  Krzyzak  &  Yuille  (1992&93) 
on  the  La  convergence  rates  of  KRE  and  RBF  nets.  Given  a  vector-valued  regression  function  R(x}  =  [r(’^(z),  ■  ■  • , 
r("’l(x)]^,  let  network  with  output  estimate  /n,/f(x)  be  trained  by  a  training  set  Djv  =  (Xk,  Y,}i^,  where  N  is  the 
number  of  training  samples  and  n  is  the  size  of  the  network,  e.g.,  the  number  of  hidden  neurons  in  the  network. 
In  Xu,  Krzyzak  Sc  YuUle  (1992&93),  we  have  studied  the  convergence  properties  of  the  La  error  |If(z)  — 

/n,i>w(*)l*^f*(®)>  l*(*)l  =  1*^'^*)!  *(*)  =  ^  **  the  domain  of  z,  and  p 

denotes  the  measure  on  z.  This  error  is  a  random  variable  because  the  training  samples  are  random  varaibles.  Thus, 
the  convergence  properties  given  in  Xu,  Krzyzak  Sc  YuiUe  (1992b93)  are  described  in  terms  of  ‘almost  surelyf,  ‘  in 
prohahihtjf .  In  this  paper,  we  study  the  convergence  properties  of  the  MISE  error: 

ta  =  EoA(R^Xx)-fnMX)f\],  (1) 

where  Xj  €  Dn.  This  error  is  not  a  random  variable.  We  will  show  that  the  upper  bounds  for  the  convergence  rate 
of  ez  are  tighter  than  those  for  the  La  convergence  rate  of  given  in  Xu,  Krzyzak  Sc  Yuille  (1992&93),  under  milder 
conditions. 

2.  RBF  Net,  KRE  and  Convergence  Properties 

As  in  Xu,  Krzyzak  Sc  YuUle  (1992A93),  we  consider  the  RBF  nets  of  the  following  normalized  version  (Moody 
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(2) 


Is  OirkeB,  1989;  Nowl«a,  1990;  Jones  et  si,  1991): 

/  _  Er,i  -  c.]) 


wkete  4{f^)  is  s  piespecUied  bssis  function  astisfying  cettsin  wesk  conditions.  The  most  common  choice  is  the 
Gsnasisn  function,  ^(r^)  s  t~''  with  £  =  ir{n)^I,  but  s  number  of  slternstives  csn  also  be  used,  (e.g.,  several 
choices  ate  listed  in  Poggio  Is  Giroei,  1989).  c,  is  called  the  center  vector  and  w,  €  R”*  is  a  weight  vector.  £  is  a 
d  X  d  positive  matrix  which  controls  the  receptive  field  of  the  basis  functions  d([s  —  Ci]‘£''*[z  —  c,]). 

Xu,  Krsysak  Is  Yuilte  (1992ls93)  has  shown  that  this  type  of  RBF  nets  has  close  connections  to  the  Kernel 
Regression  Estimator  (KRE)  studied  in  the  statistical  literature. 

Let  (X,  y)  be  a  pair  of  random  vectors  in  x  R"'  and  A(x)  =  E{Y\X  =  z)  be  the  corresponding  regression 
function.  Let  /t  denote  the  probalxlity  measure  of  X.  Moreover,  let  =  {Xi,  y}"  be  a  set  of  independent  identically 
distributed  samples  drawn  from  {X,  y).  The  kernel  regression  estimate  of  R{x)  is  defined  as  follows: 


9n(*)  =  9n(*,t>S) 


(3) 


which  is  the  weighted  average  of  V',  for  approximating  the  conditional  mean  of  Y  under  a  given  X  =  x  with  weights 
depending  nonlinearly  on  X,.  Here,  hn  is  usually  cidled  a  smoothness  parameter  and  is  a  positive  number  that 
depends  on  the  number  of  samples  n.  A"  >  0  is  a  p  integrable  kernel  on  A**.  When  the  kernel  K(x)  is  spherical 
symmetry,  we  can  rewrite  eq.(3)  as: 


9n(*) 


(4) 


It  is  not  difficult  to  see  that  by  letting 


Ar(r)  =  d(r*),  £  =  A*7,  <r(n)*  =  A*, 


and  wi  =  K,  ci  =  Xi,  i  =  1,  ■  -  ■ ,  n 


(5) 


eq.(4)  is  identical  to  eq.(2).  That  is,  a  spherically  symmetrical  kernel  K(r^)  is  just  a  special  instance  of  radial  basis 
function  model  eq.(2)  with  a  hyper-spherically  shaped  receptive  field  specified  by  the  matrix  £  =  h^I,  and  with  the 
weight  vectors  Wi,i  =  1,  ■  ■  • ,  n  being  simply  assigned  to  the  specified  values  y ,  t  =  1,  • . . ,  n.  It  is  interesting  to  notice 
that  the  assumption  of  hyper-spherically  shaped  receptive  fields  is  commonly  used  in  the  existing  studies  of  RBF  nets 
(Broomhead  It  Lowe,  1988;  Chen,  Cowan  It  Grant,  1991;  Moody  It  Darken,  1989;  Poggio  It  Girosi,  1969;  Xu,  Klasa 
It  Yuille,  1992). 

With  this  connection,  we  can  obtain  upper  bounds  for  the  convergence  rates  of  RBF  nets  via  the  convergence 

mi^e  bur  statements  more  precise,  we  first  review  some  mathematical  terms. 

Given  a  vector-valued  function  /(x)  =  ‘ *  sequence  {/nC*)}?®! 


«.(/.  /n)  =  I/(X)  -  /n(x)|,  ph(/,  /„)  =  /  |ex(/,  /n)|*dp(x), 

Jv 


(6) 


where  |*(x)|  =  for  •■’(*)  =  ' ' '  •  V  is  the  domain  of  i,  and  p  denotes  the  measure  on 

X.  For  any  «  >  0,  if  there  exists  a  specific  no  such  that  Pu(/, /n)  <  «  for  any  n  >  no,  then  /„  is  said  to  converge 
in  Lj  to  /.  Given  a  positive  sequence  {n  which  tends  to  zero  as  n  — >  oo,  the  convergence  rate  of  („  is  said  to  be  of 
0(r(n)),  if  there  is  an  explicit  positive  function  r(n)  of  n  with  r(n)  — ►  0  as  n  -♦  oo  (e.g.,  r(n)  =  n"’,9  >  0)  such 
that  an(n/r(n)  — ►  0  as  n  — ►  oo  for  any  sequence  of  positive  numbers  {o„}  which  satisfies  o„  — ►  0  as  n  — ►  oo.  Using 
Pu{.fi  /»)  4o  replace  (n,  we  get  the  deifinition  for  Lj  convergence  rate. 

A  function  approximation  scheme  is  a  device  of  a  set  of  functions  supported  on  R*.  Usually,  this  device  consists 
of  a  number  of  components  so  that  the  set  T  can  be  characterized  by  this  number  (say  n),  that  is,  we  can  denote  it 
by  Examples  of  such  devices  are  multilayer  networks  with  »  hidden  sigmoid  units  and  RBF  nets  with  n  radial 
basis  functions.  Let  =  IJ^j  then  the  function  approximation  scheme  is  said  to  have  the  property  of  universal 
approximator:  (Hornik,  Stinchocombe  It  White,  1989)  if  .Fu  is  dense  in  the  space  of  the  continuous  functions  C\U] 
defined  on  jcme  domain  V  of  A^;  or  in  other  words,  if  for  any  continuous  function  /(x)  supported  on  U,  there  exists 
a  specific  fn  €  Ft  such  that  /n(x)  converges  to  /(x)  am/ormly.  Similarly,  for  any  function  /(x)  of  a  given  a  function 
class  Fe(f^)  supported  on  U,  if  there  exist  a  specific  /»  €  Ft  such  that  /n(x)  converges  to  /(x)  in  the  1%  sense,  we 
say  that  the  function  approximation  scheme  has  the  property  of  approximation  for  the  function  class  F(U). 

These  properties  describe  the  approximation  ability  of  one  set  of  functions  to  another  set  of  functions.  For  a  given 
function  /(x),  the  properties  only  say  that  there  exists,  in -the  set  F,  defined  by  the  function  approximation  scheme. 
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ft  fftBCtioB  tkftt  cftB  ftpproximftte  /(s)  well  ••  n  oo.  They  My  nothing  ftbout  how  to  find  such  a  function.  Usually, 

is  diftiftcterued  by  a  set  9  of  nnspedlied  paiameten.  Each  specified  value  d  of  6  determines  a  /n(s)  in  Tn- 
TIm  value  d  (thus  fn(x))  is  obtained  based  on  a  set  of  observed  samples  Vn  =  {Xi,  of  a  given  function  /(z). 
Usually  these  observed  samples  Xi,Yi,-- ,  Xn,  Yn  are  identical  and  independent  random  variables  with  /(z)  bang 
their  regressioa  function,  i.e.,  f(Xi)  s  X(Xi)  =  £(yi|Jfi).  Such  a  /•(z)  is  caUed  an  estimator  of  JZ(z).  To  explicitly 
indicate  its  dependence  on  Vs,  we  denote  it  by  fn.si*)-  An  examples  of  such  estimators  is  KRE,  a  specific  RBF  net 
obtained  via  eq.(5).  Since  Vs  are  random  samples,  fn,s(*)  is  also  random  variable.  In  this  case,  the  convergence 
behavior  is  descril^  by  a  property  called  tlatUHeal  conmfency  which  describes  how  the  estimator  approaches  the 
regression  function  iZ(z)  =  B{y\X  =  z)  as  the  number  of  samfdes  tends  to  infiluty.  An  estimator  /n,Ar(z)  is  said  to 
be  L^  OMuutmt  in  probability  if  it  converges  in  £3  sense  to  Jt(x)  in  probability. 

Although  useful,  the  convergence  properties  of  universal  approximation,  £3  approximation  and  statistical  con- 
sisteacy  give  no  descriptions  on  the  rates  of  /n(z),/n,jv(z)  converge  to  /(z)  with  respect  to  the  number  n,  IV,  i.e., 
the  sise  ^  the  hidden  layer  and  the  siie  of  training  sequence.  Since  KRE  is  a  particular  specified  RBF  net  and  its 
f«(z)  belongs  to  the  set  ^1.  that  is  defined  by  the  RBF  net  eq.(2),  we  can  explore  the  convergence  rates  of  RBF 
nets  through  investigating  the  convergence  rates  of  KRE.  It  was  through  this  thought  line,  Xu,  Krzyzak  te  Yuille 
(1992&93)  obtained  a  number  of  results  on  the  convergence  rates  of  RBF  nets. 

This  paper  will  provide  some  complementary  results.  We  further  study  the  convergence  properties  of  the  following 
MISE  error 

e|  =  ((R(X, )  -  s«(X»))»].  /or  KRE 

4  =  EosmXt)  -  fnMXi))%  for  RBF. 

where  Ds  is  the  training  set.  By  taking  expectation  on  —  fn.s(Xi))^,  we  see  that  it  becomes 

Eds{E{Xi)  —  fn,s(Xi)y  when  Xi,  -  ■  ■ ,  Xs  are  i.i.d  random  variables.  In  other  words,  This  error  can  also  be 
regt^ed  as  the  estimation  error  of'the  networks  on  the  traning  set  Dm  . 

3.  Main  results 

Theorem  1  (KRE’s  convergence) 

Let  EY^  <  00,  and 

<  A'(z)  <  C3/S0.H.  0  <  r  <  /f  <  oo,ci,C3  >  0 

h  — ♦  0,  nh^  — *  00 

where  Ia  denote  indicator  of  set  A  end  5x,r  =  {y  :  ||y  —  z||  <  r).  For  yn(z)  given  by  eq.(4),  we  have 

E(R(Xi)  —  Sn(Xi))^  — »  0  as  n  — ►  00. 

The  above  Theorem  shows  that  KRE  estimator  gn{x)  given  by  eq.(4)  converges  in  the  MISE  sense.  In  comparison 
with  Theorem  2  of  Xu,  Krzyzak,  Yuille  (1992  k  93),  we  can  observe  that  the  condition  <00  s  >  0  has 

been  relaxed  into  EY^  <  00.  The  following  theorem  gives  the  MISE  convergence  rate  of  KRE  estimator  yn(z)  given 
by  eq.(4). 

Theorem  3  (KRE’s  convergence  rate) 

Let  s  denote  the  probability  measure  of  X~  with  a  compact  support,  and 

Cifso.r  <  A'(z)  <  C3/S0./,.  0  <  r  <  R  <  oo,ci,C3  >  0 

h  — »  0,  — *  00  as  n  — ♦  00 

E|y|*+*<oo  s>0 

|R(z)  -  R(y)|  <  ^||z  -  y|r,  0<Qf<l,  ^>0 


For  tm(*)  given  by  eq.(4),  we  have 

E(R(Xi)  -  y„(X,))*  =  O(n"r>+.?f>'<.+0). 


Let'f  =  2  +  s  and  put  it  into  Theorem  4,  then  the  condition  <00  s  >  0  becomes  E|i'|'  <00  t  >  2. 

Correspondingly,  the  rate/becomes  O(o  '<»•+') ).  Now  we  are  ready  to  compare  with  the  rate  0(n“  •  ~  (»«+<) ) — the 
one  given  in  Theorem  6(B)  of  Xu,  Krzyzak,  Yuille  (1992  k  93).  Let  f  =  s  and  noticing  that  <  j,  we  have 

>  (3a4j)  ~  That  is,  the  rate  given  in  Theorem  2  is  is  tighter  than  the  one  given  in  Theorem  6(B)  of  Xu, 
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Kn^Mk,  Yull«  (1993  tc  93).  In  addition,  compaiing  the  requirementa  E\Y\'  <  oo  a  >  2  with  E\Y\'  <  oo  a  >  2+f , 
aw  aw  thnt  the  condition  han  alao  been  leiaxed. 

From  the  above  theoiema  about  the  KRE  eatimator  fit(a)  given  by  eq.(4),  we  can  obaeive  that  that  for  a  given 
function  /(a)  R{*)  we  can  conatrnct  a  apediic  RBF  net  fn  €  by  aimj^y  letting  the  paiametera  0  to  aMume  the 

valnea  i»ovided  by  the  aamplea  Dn  —  {A’.,  K}"  ^vith  Dn  being  a  randomly  aelected  aubwt  of  D/t,  in  the  same  way 
aa  it  waa  done  ft>r  the  KRE  eq.(4).  Aa  a  reault,  thia  /•,«  wiU  converge  in  MISE  to  /(a)  with  the  same  convergence 
pn^wrtiea  aa  deacribed  1^  Thecwema  1-3.  Recalling  that  anch  a  apedfic  /»,»  may  not  be  the  beat  /n,jv,  TV  >  n,  so 
the  convergence  rate  of  the  beat  /n,jv,  TV  >  n  given  by  RBF  net  will  be  not  worae  than  the  rate  provided  by  Theorem 
3.  Thnt  is,  we  can  get  upper  bounds  for  the  convergence  rates  of  >  n.  These  bounds  are  described  more 

predady  in  the  following  theorems. 

Thoorom  3  (RBF’a  convergence) 

Let  EY*  <  00,  and 

$  ^(*)  ^  otISo.it,  0  <  r  <  R  <  oo,ci,C2  >  0 
h  — ►  6,  — ►  oo. 

Let  /n,iv(>)  denote  the  one  in  s  that  approximates  A(x)  beat.  We  have 

£(R( A, )  -  /:,M ))*  -  0  as  n  oo 

where  ^n.jv  denotes  the  set  of  functions  defined  by  the  RBF  nets  eq.(2)  and  trained  by  the  training  set  of  N  samples. 
Theorem  4  (RBF’s  convergence  rate) 

Let  p  denote  the  probability  measure  of  X  with  a  compact  support,  and 

oxho.r  <  4{*)  <  Ciho.n,  0  <  V  <  R  <  oo,ci,cj  >  0 
h  — 0,  — oo  us  »-.oo 

E|l'|*+*<oo  s>0 

|R(*)  -  R(,)|  <  dll*  -  »ir.  0<o<l,  d>0. 

Let  denote  the  one  in  Tn.N  that  approximates  R{x)  best.  We  have 

E(R(A,)  -  /:,M)y  = 
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Abstract  General  nonlinear  mixed  effects  models  for  repeated  measurement  data  have  a  wide  range  of 
applications  in  medicine,  pharmecokinetics,  business  and  economics,  but  may  have  a  complicated  covairiance 
structure.  Traditional  neural  network  models  for  nonlinear  regression  can  not  be  directly  applied  to  the 
nonlinear  regression  with  fixed  and  random  effects.  In  this  paper,  we  propose  a  novel  neural  network  model 
with  weighted  sum  of  square  error.  Iterative  algorithm  between  estimation  of  fixed  effects  parameter  and 
variance  components  is  presented.  A  modified  backpropagation  scheme  is  investigated  with  encouraging 
simulation  results. 


1  Introduction 

There  has  been  a  great  deal  of  increasing  interests  in  general  nonlinear  mixed  effects  models  for  repeated 
measures  data  in  which  data  are  generated  on  individuals  over  time  or  under  fixed  conditions.  Regression 
with  fixed  and  random  effects  plays  an  important  role  in  biomedical  research  including  pharmacokinetics, 
bioassay,  and  clinical  trials,  business  and  economics  (Vonesh  and  Carter,  1992).  Since  individuals  are  assumed 
to  constitute  a  random  sample  from  a  population  of  interests,  the  observed  data  have  nonconst2mt  correlation 
among  them.  TVaditional  neural  network  models  for  nonlinear  regression  C2in  not  be  directly  applied  to  the 
nonlinear  regression  with  mixed  efFects(Blum,  1992;  Freeman,  1994). 

In  this  paper,  we  propose  a  novel  neural  network  model  for  nonlinear  regression  with  mixed  effects. 
The  model  allows  for  incomplete  or  unbalanced  data  and  a  variance-covariance  structure.  The  response 
is  expressed  as  a  Sum  of  nonlinear  functions  of  fixed  effects  and  linear  functions  of  random  effects.  The 
objective  function  is  a  weighted  sum  of  square  errors.  The  estimates  of  unknown  population  parameters  are 
obtained  by  an  iterative  algorithm.  A  modified  backpropagation  scheme  involving  weight  matrix  imposed 
on  sum  of  square  errors  is  presented.  Finally,  the  model  is  applied  to  the  growth  of  orange  trees  over  time. 

2  The  Model 

It  is  assumed  that  there  are  P  distinct  individuals  (p  =  l,...,P)  and  whose  responses  can  be  expressed  as 
the  following  general  nonlinear  mixed  effects  model: 

Yp  =  f{Xp,W)  +  Zp^p  +  ep,p  =  1, . . . ,  P  (1) 

where  Yp  =  an  x  1  vector  of  repeated  measurements  from  the  p*'*  individual;  Xp  = 

[Xpi,.  --iXpr,]^  is  an  Tp  X  i  matrix  of  known  expl2uiatory  variables;  W  is  a  vector  of  unknown  parameters; 
/(Xp ,  W)  is  a  nonlinear  response  function;  Zp  is  an  rp  x  m  full-rank  matrix  of  known  constants,  pp  is  a 
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m-vector  of  unobserved  random  regression  coefficients  with  mean  zero  and  covariance  matrix  '9  ;  and  tp  the 
Tp  random  error  vector  with  mean  zero  and  variance  matrix  tr^Irp- 

The  nonlinear  function  f{Xp,  W)  can  be  approximated  by  a  neural  network.  The  bias  units  always  have 
an  output  of  one  and  they  are  connected  to  all  units  on  their  respective  layer.  Units  on  ail  layers  calculate 
their  net-input  values.  For  the  hidden-layer  units: 

k 

+  (2) 

i=l 

and  for  the  output-layer  units 


neil  =  ^W90i  (3) 

j=» 

We  assume  that  output  function  in  each  units  t^es  a  sigmoid  function.  Then  the  output  of  the  j-th  unit 
in  the  hidden  layer  is 

(4) 


Oi  =  /;(«<)  = 


,k  \  _ 


1 


1  -1-c 


-nel* 


3  Estimation  Procedure 

Let 


Cp  —  Xp0p  +  Cp 

Then,  equation  (1)  can  rewritten  as 

(5) 

Yp  =  fiXp,W)-i-ep,p=l . P 

(6) 

It  is  easy  to  see  that  the  variance  matrix  of  random  vector  Cp  is  given  by 

Vp  =  Var(ep)  =  Zp9Zj  +  o^Ip 

(7) 

Let 

_  /  \ 

. .  (8) 

...A?,,,  / 

Because  of  heteroscedasticity  of  random  vector  Cp  we  use  the  weighted  least  square  method  to  estimate 
the  weights  of  neural  networks. 

The  weighted  objective  function  is 

,  P 

^  =  2 

^  P“1 

(9) 

P  r  r 

^ 5  E  E(yp'  -  E  ’  ^))- 

p=l 1=1  k=l 

(10) 

The  estimation  procedure  is  as  follows. 

Step  1: 

The  first  stage  of  the  procedure  is  to  obtain  the  nonlinear  ordinary  least  square  estimator(OLS), 
by  minimizing  the  residual  sum  of  squares 
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(11) 


,  p 

=  i  ^)fJr,iyp  -  fiXp .  W)). 

^p=l 

This  stage  provides  an  initial  estimator  and  may  be  implemented  by  ordinary  Backpropagation 
algorithm. 

Step  2: 

After  substituting  the  estimated  residuals 

=  yp-  /(*p.  =  1. P.  (12) 

into  equation  (5),  we  obtain  the  following  random  coefficient  linear  equation: 

eW  =  Zpy9p  +  €p,p=l . P.  (13) 

Thus,  the  least  square  estimation  and  are  given  by 


^p‘)  =  (Z?’Zp)-‘2Je(*) 


e(*)(/,p-Zp(2j^p)-‘Zp^)c(‘) 


respectively. 

Step  3: 

It  follows  from  equation  (13)  and  equation  (14)  that 


(15) 


-  fip  +  (ZjZp)  ^Zp€p. 

(16) 

Since  fip 

'x  X(0, 9),  the  least  square  estimator  9^*1  is  given  by 

p=i 

(17) 

where 

s = p3T  -  vm''  - 

p=i 

(18) 

and 

r  =  ?E3S‘>- 

p=i 

(19) 

Step  4. 
De^e 

4*1  =  [Zp«(*)xJ  +  (d(*i)%]-‘. 

(20) 

The  weighted  least  square  estimation  of  weights  IV  of  neural  networks  are  obtained  by  minimizing 


E(iv,Ap(*))  =  \  5^(yp  -  f{Xp.  iy(*i))’’A(*)(yp  -  /(Xp,  iy(*))).  (21) 
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The  minimisation  of  E(W,  is  implemented  by  modified  Backpropagation,  for  detail,  see  Appendix. 
If  <  e  ,  a  prespecified  error,  then  stop,  otherwise  go  to  step  1. 


4  An  Example 

The  proposed  neural  network  model  is  applied  to  fitting  the  data  on  the  growth  of  orange  trees  over  time 
given  in  Draper  and  Smith(1981).  The  data  are  presented  in  Fig.l  and  consist  of  seven  measurements  of  the 
trunk  circumference  on  end  of  five  orange  trees.  It  turns  out  that  the  neural  network  model  fits  the  data 
very  well.  The  marginal  correlation  between  observations  on  the  same  individual  is  quite  high  and  is  greater 
at  the  larger  time  points.  This  reflects  the  large  variability  between  trees  are  compared  to  within  trees. 

Appendix 

Suppose  that  the  global  error  surface  is  given  by 

m,A)  =  ^  fV)).  (22) 

p=i  1=1  k=i 

Let 

Ep,t  =  I  A«(yp,  -  ,  W)),  (23) 

and 

«.i  (24) 

Then 

^  =  -lA.k(Oj(X('^)epk  +  0>(X(*))cp,,  (25) 

where  Oj(Xp^)  and  Oj(X^^^)  are  the  output  for  unit  j  in  the  hidden  layer  when  Xp^  and  are  input 
to  the  neural  network  respectively. 

Similarly,  we  can  find  the  gradient  of  the  error  surface  with  respect  to  the  hidden-layer  weights: 

=  -iA«(Oj(X('))X«Cp*  -H  0;(X(*))X(‘>)cp,,  (26) 

where 

0<(X('))  =  Oi(X(')(l  -  0^(X«),  (27) 

o;.(x<‘))  =  0,(X(‘)(1  -  0,(X(‘)),  (28) 

zmd 

^  =  -iA„lOj(X(‘Hl  -  0,(X('))ep»  +  [0,(X(‘)(1  -  Oj(X(^^)e„.  (29) 

Hence 

fv;(t  +  1)  =  w;(t)  +  ft^A,t(Oj(X(0)e,t  -I-  0,(X(*))ep,.  (30) 
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^*(1  +  1)  =  -  Oj (:((•> ^|:i>rrk  f  0^(A(*))(l  -  0;(A(*)))A<‘^r;.,].  (31) 

fl^(<  +  1)  =  0^(0  +  »>5/li*[Oi(A(')(l  -  +  mAfHl  -  0,(A(*>)v.  (32) 
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FIgsre  I.  Trunk  circumference  (in  millimeters)  of  five  orange  tree.v  I)aln  and  individual  fiMet 
™tve8  from  RMl.  estimation.  Dashed  line  represents  the  mean  curve. 
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Abstract 

In  the  present  paper,  we  attempt  to  show  that  the  weight  decay  method,  used 
frequently  to  improve  the  generalization  performance,  is  a  method  to  decrease  the  re¬ 
dundancy  in  networks.  To  confirm  our  hypothesis  of  the  weight  decay  as  a  process 
of  the  redundancy  reduction,  we  performed  two  experiments:  the  inference  of  regular 
English  verbs  and  the  inference  of  regular  and  irregular  verbs  with  the  grammatical 
determination.  In  both  cases,  we  could  explicitly  see  that  the  redundancy  was  de¬ 
creased,  proportional  to  the  generalization  errors,  when  the  weight  decay  term  was 
added.  Thus,  we  think  that  the  weight  decay  is  only  a  special  case  of  more  general 
redundancy  reduction  methods  for  the  improvement  of  the  generalization  performance. 


1  Introduction 


Many  techniques  have  been  proposed  to  improve  the  generalization  performance  of  neural  net¬ 
works  [1],  [2],  [5],  [9].  The  weight  decay  method  has  been  well  known  in  the  circle  of  neural 
networks  and  widely  used  to  improve  the  generalization  performance.  However,  we  have  lit¬ 
tle  knowledge  on  the  reason  why  the  decay  term  can  improve  the  generalization  performance. 
One  possible  answer  was  proposed  by  Krogh  and  Hertz  [3].  They  argued  that  the  decay  term 
suppressed  irrelevant  parts  of  the  weight  vectors  and  the  effects  of  static  noises. 

We  think  that  the  weight  decay  can  decrease  the  redundancy  in  networks  and  thus  tries  to 
maximize  the  possible  information  for  new  patterns.  In  other  words,  networks  try  to  minimize 
the  information  content,  specific  to  training  input  patterns.  Redundancy  is  usually  defined  by  the 
difference  between  maximum  entropy  and  observed  entropy: 


R 


Rmaz  _  ff 


H„ 


(1) 


where  is  a  maximum  entropy  and  H  is  an  observed  entropy.  Thus,  using  this  redundancy, 
our  objective  is  to  show  that  networks  try  to  minimize  this  redundancy  R  by  using  the  weight 
decay  term  for  the  improvement  of  the  generalization  performance.  Finally,  one  of  the  difficult 
problems  is  to  define  an  information  or  redundancy  for  networks.  We  have  defined  an  entropy  for 
the  internal  representation.  Thus,  to  reduce  the  redundancy  corresponds  to  the  reduction  of  the 
redundancy,  defined  for  the  internal  representation. 
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2  Theory  and  Computational  Methods 


Let  us  formulate  an  entropy  function  for  the  internal  representation.  Suppose  that  a  network  is 
composed  of  three  layers:  input,  hidden  and  output  layers.  Hidden  unit  activities  are  denoted  by  v, 
and  input  terminals  by  (j.  Then,  input-hidden  connections  are  denoted  by  and  hidden-output 
connections  are  denoted  by  Wij. 

A  hidden  unit  produces  an  output 

»?  =  M), 


where  /  is  a  logistic  function  define  by 


/K^)  = 


1  -f  e-“* 


and  where  uf  is  a  net  input  to  tth  hidden  unit  and  defined  by 

i=i 

where  is  tth  element  of  an  input  pattern  and  L  is  the  number  of  elements  in  the  pattern.  An 
entropy  for  A:th  input  pattern  is  defined  by 


Hk  =  -Elpflogpf, 


(2) 


where 

Pi  ~  Y'  yk* 

where  the  summation  is  over  all  the  hidden  units.  Averaging  over  all  the  input  patterns,  we  have 


« =  (3) 

•'''  k  i 

where  M  is  the  number  of  hidden  units  and  N  is  the  number  of  input  patterns.  The  redundancy 
is  defined  by  the  difference  between  the  maximum  entropy  and  the  observed  entropy.  That  is, 

R  =  (4) 

■^max 

AriogM  +  ErE.^P.iogp. 

NlogM 

2.1  Weight  Decay  as  a  Process  of  Information  Minimization 

Weight  decay  method  is  a  very  popular  method,  used  to  improve  the  generalization  performance. 
For  the  weight  decay,  the  sum  of  the  squared  weights: 

^  O' 
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must  be  minimized.  Differentiating  both  sides  of  this  equation,  we  have, 


dC 

dwij 


— AtOij. 


Now,  let  us  see  how  the  information  is  changed  as  the  weight  is  close  to  zero.  It  is  easily  verified 
that 

ATlogM  '  ■' 


Thus,  as  the  weights  approaches  zero,  the  redundancy  also  approaches  zero.  In  other  words,  the 
weight  decay  method  is  considered  to  be  one  of  methods  to  minimize  the  redundancy. 


3  Results 

3.1  Data  and  Network  Architectures 

In  experiments,  we  trained  networks  to  produce  correct  past  tense  forms,  given  various  verb  stems 
of  artificial  languages,  close  to  English.  All  the  artificial  languages  were  composed  of  strings: 
CVC,  CCV,  and  VCC,  where  V  is  a  vowel,  and  C  is  a  consonant.  Each  string  was  represented 
in  a  phonological  representation  with  eight  bits,  used  by  Plunkett  et  al.  [6].  The  number  of 
training  patterns  was  100  for  the  regular  verbs  and  200  for  the  irregular  verbs.  The  number  of 
validation  patterns  was  500  and  testing  patterns  was  also  500  patterns  for  all  the  experiments. 
The  number  of  input,  hidden,  and  output  units  was  18,  30,  20  respectively  for  the  inference  of 
regular  verbs  and  18,  30,  21  respectively  for  the  inference  of  regular  and  irregular  verbs  with 
grammatical  determination.  Networks  started  to  learn  with  initial  random  values  (-0.25,  0.25). 
The  parameter  for  the  momentum  term  was  fixed  to  0.9  for  all  the  experiments.  The  learning 
was  performed  by  using  the  so-called  Batch  learning,  meaning  that  weights  were  updated  after 
processing  all  the  input  patterns.  The  learning  was  considered  to  be  finished,  only  when  the 
epoch  was  200.  If  the  over-training  was  observed,  the  learning  was  stopped  immediately  before 
the  over-training  (BP-stop-learning). 

3.2  Redundancy  and  Generalization  Errors 

Figure  1  shows  the  generalization  errors  (an  upper  figure)  and  the  redundancy  (a  lower  figure), 
computed  with  standard  BP  and  weight  decay  for  the  inference  of  regular  verbs.  In  this  case, 
only  if  the  Hamming  distance  between  outputs  and  targets  was  zero,  the  trial  of  networks  was 
considered  to  be  a  success.  Generalization  errors  were  normalized,  ranging  [0,1].  As  you  can 
see  from  the  figure,  the  generalization  errors  by  standard  BP  remain  constant,  meaning  that 
the  standard  BP  could  not  generalize  at  all,  though  the  errors  for  the  training  patterns  were 
completely  zero.  On  the  other  hand,  the  generalization  errors  by  the  weight  decay  is  decreased 
gradually.  Let  us  see  a  lower  figure.  It  is  easy  to  see  that  in  direct  proportion  to  the  generalization 
errors,  the  redundancy  is  decreased,  when  the  weight  decay  is  used.  Except  the  first  stages  of  the 
training  epochs,  the  redundancy  is  decreased,  in  direct  proportion  to  the  generalization  errors. 

Then,  we  used  irregular  verbs  in  addition  to  regular  verbs  and  attempted  to  determine  the 
well-formedness  of  obtained  strings.  Figure  2  shows  generalization  errors  (an  upper  figure)  and 
the  redundancy  (a  lower  figure),  computed  with  standard  BP  and  with  weight  decay.  As  you  can 
see  from  the  figure,  by  using  standard  BP,  generalization  errors  remain  constant.  On  the  other 
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hand,  by  using  the  weight  decay  method,  the  generalization  errors  are  decreased  significantly. 
A  lower  figure  shows  the  redundancy,  computed  with  standard  BP  and  the  weight  decay.  The 
redundancy,  computed  with  standard  BP  is  decreased  very  slowly.  On  the  other  hand,  the 
redundancy,  computed  with  the  weight  decay,  is  quickly  decreased.  Thus,  these  results  show  that 
the  generalization  errors  are  in  direct  proportion  to  the  redundancy  and  the  weight  decay  is  a 
method  to  decrease  the  redundancy. 


4  Conclusion 

In  the  present  paper,  we  have  attempted  to  show  that  the  redundancy  is  decreased  by  using  the 
weight  decay  method  and  the  weight  decay  method  is  only  a  method  to  minimize  the  redundancy 
in  neural  networks.  The  weight  decay  is  a  popular  method  to  improve  the  generalization.  However, 
we  have  observed  that  there  are  some  cases  in  which  the  decay  method  is  not  so  effective.  In 
these  cases,  we  think  that  the  direct  redundancy  reduction  with  the  weight  decay  method  is 
effective  in  improving  the  generalization  performance. 
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Generalization  Errors 


Figure  1:  Generalization  errors  (upper  figure)  and  Redundancy  (lower 
figureV  computed  for  the  inference  of  regular  verbs.  The  learning  rate 
and  the  parameter  A  for  the  weight  decay  were  set  to  0.1  and  0.005 
respectively. 
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Generalization  Errors 


Figure  2:  Generalization  errors  (upper  figure)  and  Redun¬ 
dancy  (lower  figure),  computed  for  the  inference  of  regular 
and  irregular  verbs  with  grammatical  determination.  The 
learning  rate  and  the  parameter  A  for  the  weight  decay  were 
set  to  0.05  and  0.005  respectively. 
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Abatract:  Many  approaches  have  been  devised  to  vary  the  number  of  hidden  units  in  a  backpropagation-based 
neural  network  in  a  systematic  way  to  obtain  a  more  ^icient  network.  This  paper  presents  an  approach  that  adds 
ludden  units  and  then  selectively  removes  them  to  determine  a  more  efficient  network  for  a  particular  problem.  The 
addition  phase  adds  hidden  units  one  at  a  time  until  a  convergence  criterion  is  satiffied.  The  selective  pruning  process 
selects  the  least  important  hidden  unit  to  be  removed  based  on  the  weights  associated  with  the  hidden  units.  This  is 
an  iniprovement  over  a  previous  method  (Hirose  et.  al.  1990)  in  which  the  order  of  hidden  unit  chosen  for  removed  is 
fixed.  Simulations  using  Boolean  test  cases  were  carried  out  on  a  backf^opagadon  network  with  one  hidden  layer  and 
the  results  show  improvements  over  the  previous  method. 

1.  Introduction 

The  backpropagation  (BP)  algorithm  (Romelhait.  Hinton  and  William  1986;  Webos  1974)  opened  a  new  way  for 
training  multi-layer  networks.  It  provides  a  solution  to  some  problem  encountered  by  single  layer  percqtrions 
(Minsky  and  FipM  1988)  and  works  well  in  many  qjplicatioos. 

However,  the  original  BP  algorithm  does  not  address  the  issue  of  the  construction  of  the  optimal  network 
aichitecture  necessary  for  the  learning  of  the  intended  irqput-output  mapping  for  a  particular  problem.  In  particular, 
the  near-optimal  number  of  hidden  units  in  the  network  is  usually  obtained  through  trial-and-error  experimentation 
triiich  is  ad  hoc  and  often  time<onsuming.  Some  solutioos  to  this  problem  have  been  prt^posed  that  involve  the  use 
of  mechanisms  that  permit  the  network  to  grow  and  shrink,  le..  creating  or  removing  hidden  ttnit(s)  or  hidden 
layer(s)  during  the  process  of  training  (e.g..  Ash  1989;  Hirose  et  aL  1991). 

An  optimal  network  facilitates  the  consiniction  of  an  internal  rqxesentation  that  is  appropriate  for  learning 
the  desired  ituqtping.  This  avoids  both  the  problem  of  a  larger  than  necessary  network  that  d^  not  generalize  well 
(Denker  et  aL  1987)  mid  the  problem  of  a  network  that  is  too  small  and  that  does  not  possess  enough  power  to  leam 
the  mapping  correctly. 

Very  often,  an  optimal  networit  may  be  difficult  to  obtain,  but  a  near-optimal  network  or  a  mote  efficient 
network  with  a  smaller  number  of  hidden  units  can  be  obtained  through  one  of  the  methods  that  modify  the  number 
of  hidden  units  in  a  systematic  matuia  until  such  a  network  is  found.  These  methods  (e.g..  Ash  1989;  Hirose  et  al. 
1991;  SietmaandDow  1988)em|doy  three  ways  to  modify  the  number  of  hidden  units: 

i)  start  with  fewn’ units  and  add  some  mote  ^xire  adding), 

ii)  start  with  too  many  units  and  take  some  away  (pure  pruning),  or 

iii)  start  with  fewer  units  and  after  addiiig,  take  the  redundant  units  away. 

A  piae  adding  approach,  such  as  the  Dynamic  Node  Oeation  method  (Ash  1989),  adds  hidden  units  one  at  a 
time  to  the  network.  A  cotain  derired  accuracy  the  network  is  specified  arxl  addition  of  a  node  is  carried  out  when 
the  accuracy  is  not  leached  mid  further  training  does  not  bring  about  any  further  improvement  of  the  accuracy.  This  is 
rqiealed  until  the  desired  accuracy  is  achieved.  Conversely,  a  pure  pruning  techiiique  starts  with  a  n^ork  that  is 
linger  than  apptopriaie  and  removes  redundant  nodes  (e^.,  Sietma  and  Dow  1988). 

Methods  such  as  that  of  Hirose.  Yamashita  and  Hijiya  (Hirose  et  aL  1991,  henceforth  referred  to  as  the 
Ifitose  algorithm)  combine  additioo  and  removal  of  hidden  units.  First  addition  is  carried  out  until  the  network  has 
learnt  a  set  of  weights  for  the  omect  mapping.  Any  redundant  units  can  then  be  pruned  off  to  achieve  a  mote 
efBcient  size.  The  most  recently  added  hiddra  unit  is  chosen  for  removal  first 
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In  a  method  that  uses  additioo  of  hidden  nodes  foBowed  by  pruning  of  die  network,  an  effective  pruning 
technique  is  essential  to  ensure  that  no  added  hidden  unks  are  removed  wiongfy.  This  prevents  extra  effort  that  would 
be  needed  to  retnin  the  network  if  important  hidden  mitt  were  wroqgly  praaed.Hiis  paper  describes  such  an  effective 
prunir^  approach  in  the  context  of  an  addition  cum  removal  method  that  repreaerns  an  improvement  over  the  pruning 
method  used  in  the  Hirose  algorithm. 

We  uae  the  same  method  as  that  of  die  Ifiroae  algorithm  (deacrflied  m  Section  2)  for  the  addition  of  hidden 
units.  However,  in  the  pruniiig  phase,  instead  of  usiitg  a  fixed  removal  order  for  the  hiddra  units  (ix.,  removing  the 
moat  recendy  added  hidden  unit  first)  like  in  the  Hiiose  algorithm,  which  could  result  in  some  proUenu  as  described 
in  Section  3,  our  method  selectively  removes  the  hidden  units  baa^  on  their  importance.  This  method  is  described  in 
Section  4.  Simulations  with  Boolen  lest  cases  were  carried  out  in  a  network  with  one  hidden  layer  and  the  results 
are  presented  in  Sections  S  and  6. 

2.  The  Himac  Algorithm 

In  the  additioo  phase,  the  algorithm  starts  with  one  hidden  unit  We^ht  corrections  are  the  same  as  those  in  the  BP 
algorithm.  E.  the  total  error  between  the  target  and  the  actual  output  is  expreaaed  as 

where  tpj  is  the  desired  output  of  an  output  unit  j  for  pattern  p,Opj  is  the  actual  output  of  output  unit  j  for  pattern  p, 
and  the  sum  is  taken  over  the  set  of  output  units  for  the  set  of  training  patterns. 

E  is  checked  after  every  100  weight  correctioos.  If  it  decreases  by  less  than  1%  of  its  previous  value  (i.e., 
the  error  100  weight  corrections  ago),  a  new  hidden  unit  is  added.  However,  if  E  decreases  by  more  than  1%,  no 
hidden  unit  is  adcM  and  the  weights  are  corrected  another  100  times.  The  criterion  for  convergence  is  when  E  is  less 
than  0.01. 


When  a  new  hidden  unit  is  added,  die  initial  weights  between  the  added  hidden  unit  and  the  other  units  in  the 
hqiut  and  output  layers  are  assigned  random  values. 

The  algorithm  enters  the  reduction  phase  as  soon  as  die  network  converges.  The  most  recently  added  hidden 
unit  is  removed.  Ttainmg  continues  with  the  pinned  network.  If  the  hidden  network  converges  again,  another  hidden 
unit  is  removed.  This  procedure  is  repeated  the  network  no  lortger  converges,  whereby  the  dgorithm  judges  the 

present  netw-’^ik  to  be  incapable  of  learning  the  mapping.  Hence  the  required  number  ot  hidden  units  is  one  plus  the 
present  number  of  hidden  units. 

3.  Effects  of  Order  of  Removal  in  Hirose  Algorithm 

In  the  Hirose  algorithm,  the  most  recently  added  hiddea  unit  is  removed  first  A  question  arises  as  to  how  the 
network's  performance  Wl  be  affected  if  another  hidden  unit  is  removed  first  instead.To  answer  that  we  investigated 
the  effect  of  changing  the  order  of  removal  of  hiddea  unks  using  the  PAR2  problem  (parity  problem  with  an  input 
pattern  of  size  2,  or  XOR).  The  simulation  resulted  in  4  hidden  units  at  the  end  of  the  addition  phase.  The  weights  of 
the  network  before  the  reduction  phase  are  shown  in  Figure  1. 

Hidden  Output 


Figure  1.  Weights  between  the  hidden  and  omput  layer  of  a  PAR2  simulation  before  reduction  phase. 

In  one  trial,  the  most  recently  added  hidden  nnk  was  removed  first  and  in  another,  the  initial  hidden  unk  that 
the  network  started  with  was  removed  first  The  behaviors  of  the  network  are  gr^ihed  in  Figure  2  and  Figure  3 
reflectively. 


Both  removals  resulted  in  2  hidden  units.  However,  when  the  initial  hidden  unk  was  removed  first  (Figure 
3),  the  network  look  more  than  twice  as  kmg  a  time  than  in  the  case  of  the  other  to  converge  before  the  next  hidden 
unk  could  be  removed.  Also,  the  error  after  the  removal  of  the  first  hidden  unk  rises  to  a  mu^  higher  level  compared 
10  that  in  the  case  where  the  most  recently  added  hidden  unk  was  the  first  to  be  removed. 
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Parity  2 


Flfwv  2.  Graph  of  total  error  E  versus  the  number  of  Uerations  for  PAR2  when  the  4th  (most  recently  added) 
hidden  unit  was  removed  first  Hidden  units  were  added  at  iterations  300. 4(X),  and  S(X).  Pruning  took  {dace  at 
iterations  711, 723,  and  800.  The  final  architecture  thus  requked  2  hidden  units. 
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F%nrc  3.  Graph  of  total  error  E  versus  the  total  number  of  iterations  for  PAR2  when  the  initial  hidden  unit  was 
removed  first  Removal  of  subsequent  hidden  units  was  slower  and  the  errors  immediately  after  the  removal  of  the 
hidden  units  wete  Ugher. 

Conqiaring  the  abstdute  values  of  weights  (Figure  1).  W1  is  the  largest  weight  and  W4  is  the  smallest 
conqioied  to  test  of  the  weights.  Thus,  the  first  hidden  unit  is  likely  to  be  the  most  important  hidden  unit  In  this 
exaiqde,  the  moat  recently  added  unit  happened  to  be  the  least  important  unit  In  general,  th^  may  not  be  the  case. 
Hence,  uaiiig  a  fixed  removal  order  gives  rise  to  the  possibility  erf  removing  an  important  unit  which  contributes 
sigmficantfy  to  the  inputAwitpot  imgiping.  This  results  in  the  longer  time  needed  for  convergence.  There  is  also  the 
IvwfilMliry  diat  the  removal  of  the  wrong  mde  would  result  in  a  less  optimal  network  -  i.e.,  one  with  more  nodes 
than  necessary. 
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4.  Sckctlvt  Praafaig 

We  iairadnoe  a  nodified  method  called  w^ectt've />nMMg  tbtt  iinpioves  on  die  faed  itinoval  Older  method  by  taking 
inio  coaMdenaion  the  magnitiidcs  of  the  weights  betweemhc  hidden  and  output  byCT  before  pruning. 

The  idea  of  selective  pruning  is  to  lemove  the  hidden  unit  that  contributes  the  smallest  total  weight  to  the 
output  units.  That  is.  the  hkhka  unit  oorreqxndii^  to 

H 

minCXl  wj|l) 

i  j 


is  piuned,  where  wji  is  the  weight  associated  with  die  connection  between  a  hidden  unit  i  mid  the  output  unit  j,  and 
H  Is  the  t^  number  of  hidden  units. 


The  absolute  value  of  each  weight  is  considreed  since  it  is  undesirable  to  remove  a  hidden  unit  with  a  large 
negative  weight  (as  well  as  one  with  a  large  positive  weight)  since  this  weight  contributes  signifkandy  to  the 
inhdMtory  si|^  sent  to  the  output  node. 

The  selective  pruning  technique  involves  three  steps: 

1)  Remove  the  hidden  unit  with  the  smallest  sum  oi  absolute  weights  associated  with  the  connections  to 
the  ouqiut  units.  This  sum  is  computed  acconfing  to  Eq  1. 

2)  Train  the  network  with  the  remaining  hidden  units. 

3)  Repeat  1)  and  2)  until  the  network  can  no  longer  converge. 

5.  Simnlatioaa  Results  Obtained  for  PAR4 

Rgmes  4  and  S  illustrate  the  results  obtained  with  PAR4  (parity  problem  with  input  pattern  size  4)  using 
the  Ifiioae  fixed  pruning  method  and  the  selective  pruning  method  reflectively.  Both  simulations  were  based  on  the 
same  set  of  initial  weights. 
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Figure  4.  Fixed  Pruning  applied  to  PAR4. 8  hidden  units  were  added  and  3  hidden  units  were  removed.  The  graph 
on  the  right  shows  an  enlat^  potliao  of  the  pruning  process. 
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Figare  5.  Selecdve  pruning  applied  to  PAR4.  using  the  same  initial  weights  as  in  Figure  4. 4  units  were  pruned. 
The  gr^  on  the  right  shows  an  enlarged  poftkn  of  the  pruning  process. 

In  both  fjitiHlatiniM,  8  hirfHen  units  were  added.  Selective  pruning  removed  4  hidden  units,  in  contrast  to 
Ifiroae  fixed  pruning  which  removed  3  hidden  units.  The  enlarged  portions  of  both  pruning  processes  show  that 
within  the  same  100  itentioos,  selective  pruning  removed  more  hidden  units,  compel  to  Fhtose  fixed  pruning. 
Hence,  the  removal  of  subservient  hidden  units  is  Caster  in  selective  pruning. 

Also,  the  figures  show  that  with  selective  pruning,  the  error  immediately  after  pru^g  a  hidden  unit  rose  to 
a  level  lower  than  that  in  fixed  pruning.  This  is  berause  the  unit  removed  in  sele^ve  pruning  contributed  less  to  the 
mapping  than  in  the  case  of  fii^  pruidng.  Thus  diese  two  illustrations  show  that  selective  pruning  removes  more 
hidden  anits)!utrr  without  driving  the  error  up.  due  to  the  bet  that  the  correct  redundant  unit  was  removed. 

f.  Selective  Pruning  Applied  to  Other  Boolean  Te^  Cases 

Simulations  were  carried  to  test  how  weO  selective  pruning  works  for  the  other  Boolean  test  cases.  Boolean  tests 
cases  were  culled  Cram  Rumelhart  and  McClelland  (1986,  chapter  8).  These  include  the  parity,  symmetry,  encoder 
problem,  and  binary  adddon  with  carry.  A  total  of  SO  triab  were  carried  out  for  the  same  set  of  initial  weights  in  the 
range  (*1, 1)  using  learning  rate  0.7  and  momentum  0.7.  The  average  number  (rf  hidden  units  arrived  at  is  tabulated 
in  Table  1. 

The  laUe  shows  that  the  selective  pruning  method  performed  better  than  the  fixed  pruning  procedure  in 
being  aUe  to  arrive  at  a  smaller  number  of  hidden  units  on  die  average.  The  largest  inqirovements  are  in  the  cases  of 
PAR6  and  ADDS.  Also,  for  selective  pruning,  removing  of  subsequent  hidden  units  took  place  very  much  faster  after 
the  first  hidden  unit  was  removed.  In  many  cases,  this  could  be  as  fast  as  retpiiring  only  1  or  2  iterations. 

7.  Coaclasions 

In  ^<;<iwibiniidaiMitkin  and  selective  pnming  method  the  leAindant  hidden  units  added  daring  the  training  proceM  are 
removed.  The  selective  pruning  method  condders  the  magnitudes  ol  the  trained  weights  bef«e  selecting  the  hidden 
unit  to  be  removed.  The  method  removes  the  hidden  unit  with  the  smallest  total  of  the  absolute  weights  associated 
with  the  connections  to  the  output  units.  This  has  been  shown  to  1)  remove  more  hidden  units,  resulting  in  a  more 
efficient  network;  9  increase  die  likdihood  that  redundant  hidden  units  are  removed,  and  consequently  3)  reduce  the 
effort  needed  10  retrain  the  pruned  network  which  resubs  in  faster  removal  of  subsequent  hidden  units. 

The  above  method  overcomes  the  problems  faced  by  a  method  that  uses  a  fixed  removal  order  of  hidden 
units.  For  more  complex  architectures  oonsisling  a[  more  hidden  units  and  more  weight  corrections,  removing  the 
"right"  hidden  units  is  especially  importanL  Rather  than  removing  the  hidden  units  in  a  fixed  manner,  it  would  be 
mnm  rmrifnt  if  ttw.  alg|nriihm  l<»  capriite  nf  femnving  ledimdant  hidden  unite  in  a  imn-athitrary  and  effective  manner. 
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Fixed 

Pruning 

Selective 

Pruning 

PAR2 

2.08 

2.02 

PAR4 

4.94 

4.34 

PARS 

6.46 

5.31 

PAR6 

12.60 

8.20 

SYM4 

3.04 

2.94 

SYM6 

2.33 

2.02 

ADD2 

4.03 

4.00 

ADD3 

9.15 

8.77 

ENC16 

4.00 

4.00 

Table  L  Average  number  of  hidden  units  for  Hirose  fixed  pnning  method  (in  which  the  most  recently  aAterf  hidden 
unit  is  removed)  and  selective  pruning  method  (in  which  the  hidden  unit  with  the  smallest  total  of  the  absolute 
weights  associated  with  all  its  connections  lo  the  output  units  is  removed).  Both  simulations  were  based  on  the  swne 
set  of  initial  weights  in  the  range  (-1, 1). 
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Abstract —  A  constructive  training  algorithm  for  supervised  neural  network,  based  on  a 
linear-programming  procedure,  is  presented.  It  builds  two-layer  single-output  networks  that 
implement  any  consistent  training  set  of  binary  or  real-valued  examples.  The  algorithm  can 
incorporate  a  s^-pruning  technique;  in  fact  it  can  ^termine  the  percentage  variation  of  ^e  examples 
which  are  satisOed  by  the  construction  of  any  further  hidden  neuron.  Simulations  show  satisfactory 
results. 

1  Introduction 

In  the  present  paper  we  propose  a  constructive  training  algorithm  for  supervised  single  hidden 
layer  neural  networks.  The  algorithm  is  guaranteed  to  implement  any  consistent  training  set  of 
binary  or  real-valued  examples  classified  into  two  classes.  It  extends  the  constructive  algorithm 
based  on  a  linear-programming  approach  presented  in  ref.  [1]  and  results  similar  to  the  'sequential' 
training  algorithm  of  Maichand  et  al.  [2],  without  the  restriction  of  binary  input  only.  Our  approach 
is  based  on  the  following  remarks: 

1)  For  classification  problems  on  a  point  set,  it  is  known  that  a  single  hidden  layer  is  sufficient  to 
implement  any  task  [3].  In  fact,  the  hidden  neurons  define  the  hyperplanes  which  separate  the 
decision  regicm  that  is  an  apinoximate  version  of  the  true  decision  region  for  the  problem; 

2)  Hyperplanes  coincident  or  very  close  to  the  previous  ones  are  determined  step-by-step  by  means 
of  a  procedure  inspired  by  the  'simplex  method'; 

3)  Our  constructive  algorithm  is  inherently  self-pruning,  since  it  is  able  to  measure  die  importance 
of  each  hyperplane  to  the  total  solution  of  the  given  problem.  Therefore,  we  have  all  the 
information  necessary  for  applying  a  sinqile  and  effective  pruning  of  the  neurons  and  for  obtaining 
a  simultaneous  control  of  the  resulting  generalization  capability  of  the  net. 

2  The  Proposed  Algorithm 

The  proposed  algorithm  is  based  on  the  constructive  procedure  described  in  ref.  [1].  That 
procedure  builds  a  cascade  scheme  which  satisfies  all  the  examples  of  a  training  set  by  a  suitable 
number  of  neurons.  In  order  to  understand  the  method,  it  is  necessary  to  summarize  the  procedure 
followed  in  determining  the  cascade.  Let  us  denote  by  type  1  and  0  respectively  the  examples  of  the 
training  set  belonging  to  the  two  possible  classes.  In  correspondence  to  step  k,  the  k-th  neuron  of 
the  cascade  is  determined  by  carrying  out  the  three  following  substeps: 
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i)  detenninadon  of  the  residual  training  set.  The  number  of  examples  decreases  at  each  step  (i.e., 
we  have  a  ctmvergence  to  zero-enors); 

ii)  determination  of  the  hyperplane  in  the  input  space,  which  allows  to  correctly  classify  all  the 
examples  of  one  type  and  the  maximum  number  of  the  other  type.  This  substep  is  carried  out  by  a 
linear  programming  procedure  based  on  the  simplex  method.  Namely,  at  each  application  of  the 
procedure  a  'feasible  set'  of  the  linear  inequalities,  that  correspond  to  the  training  examples  suitably 
modified,  is  found; 

iii)  detomination  of  the  ccamection  weights  and  the  threshold  of  the  k-th  neuron  from  the  previous 
hyperplane. 

The  cascade  obtained  at  the  end  of  the  k-th  step  satisfies  a  certain  percentage  of  examples  of  the 
original  training  set,  say  F^.  It  is  important  to  note  that  P]^Pic-l- 

The  proposed  method  builds  a  single  hidden  layer  network  by  means  of  two  applications  of  the 
previously  described  procedure.  It  stans  by  applying  this  procedure  to  the  given  training  set.  The 
neurons  of  the  resulting  cascade,  with  the  same  connection  weights  and  tlresholds,  constitute  the 
first  layer  of  the  neural  network  we  are  constructing.  Some  of  these  neurons  can  be  eliminated  by 
the  pruning  technique  discussed  in  sect  3.  Then,  we  determine  the  outputs  of  the  neurons  of  the 
first  layer  corresponding  to  all  the  inputs  of  the  original  training  set.  These  outputs  take  on  the 
values  {1,0}  since  the  activation  function  of  the  neurons  of  the  cascade  is  hard.  The  set  of  input- 
output  pairs  identified  by  the  calculated  outputs  of  the  first  layer  neurons  (as  inputs  for  the  second 
layer)  and  by  the  original  desired  outputs  constitutes  a  new  binary  training  set.  Since  it  is  always 
possible  to  solve  a  classification  problem  on  a  point  set  by  only  one  hidden  layer,  the  new  training 
set  is  characterized  by  linearly  separable  binary  examples.  Hence,  a  second  application  of  the 
linear-programming  procedure  on  this  linearly  separable  training  set  constmcts  a  single  neuron  that 
constitutes  the  second  layer  of  the  network. 

3  Pruning  and  Generalization  Capability 

The  generalization  capability  of  the  two-layer  network  regards  its  operation  with  respect  to 
examples  outside  the  traiiung  set  It  can  be  strongly  hampered  by  wrong  examples  present  in  the 
training  set,  since  the  training  algorithm  will  try  to  accommodate  them  in  contrast  with  the 
remaining  ones.  Consequently,  a  robust  training  algorithm  should  be  able  to  detect  wrong 
exariq)les  and  to  ne^et  theta 

The  algmithm  proposed  in  the  present  paper  can  be  easily  tailored  to  incotporate  this  performance, 
since  it  is  possible  to  measure  the  importance  of  each  neuron  to  the  formation  of  the  decision 
region.  In  fact,  when  we  apply  the  cascade  procedure  for  determining  the  hidden  layer,  we  can 
simultaneously  determine  the  percentage  variation  of  the  training  examples  which  are  satisfied  by 
the  addition  of  any  further  neuron.  It  is  evident  that,  when  the  percentage  is  very  close  to  100%, 
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the  remaining  exanoples  to  be  satisfied  are  very  different  from  the  previous  already  satisfied.  This 
difference  is  usually  due  just  to  the  presence  of  wrong  examples.  A  reasonable  strategy  in  this  case 
is  consequently  to  prune  away  the  final  neurons  of  the  cascade  by  retaining  only  those  which 
guarantee  a  desired  threshold  on  the  percentage  of  satisfied  examples. 

As  an  illustration  of  this  point,  we  consider  the  simple  case  of  classifying  two  regions  in  a  plane 
separated  by  a  straight  line.  The  training  set  is  constituted  by  examples  concerning  with  18  points 
uniformly  located  in  the  first  regimt  and  other  18  similarly  located  in  the  other  region.  One  of  those 
examples  is  then  set  to  the  wrong  label.  In  Fig.  1  these  points  are  represented  by  black  or  white 
dots  together  with  the  lines  corresponding  to  the  neurons  which  the  cascade  procedure  successively 
determines.  The  lines  are  labelled  'a',  'b',  and  'c*  in  order  of  succession.  We  see  from  the  figure 
that  line  'a'  is  such  that  the  points  are  correctly  classified  independently  from  the  error.  The  first 
neuron  of  the  cascade  attains  a  percentage  of  correct  classification  equal  to  97,2%  (35  over  36 
examples  of  the  training  set).  This  percentage  does  not  vary  when  a  further  neuron  is  added.  In 
fact,  the  line  corresponding  to  this  neuron,  i.e.  'b',  is  not  sufficient  to  separate  the  wrong  point. 
Only  where  we  add  the  last  neuron,  which  introduces  line  'c',  the  percentage  rises  to  100%. 
However,  it  is  evident  the  overfrtting  operated  by  the  cascade  algorithm,  when  we  try  to  arrive  to  a 
zero-error  solution.  It  is  also  clear  the  simple  and  effective  pruning  technique  which  we  can 
implement  by  controlling  the  percentage  of  training  examples  satisfred  by  the  cascade  scheme 
during  its  construction.  It  is  sufficient  to  stop  the  construction  by  pruning  away  the  successive 
neurons,  when  the  percentage  of  satisfied  examples  attains  a  suitable  threshold. 


Fig.  1:  Example  related  to  the  adopted  pruning  technique  (classification  on  a  set  of  36  points). 

4  Simulations 

In  order  to  illustrate  the  proposed  algorithm,  we  will  describe  in  the  following  simulations  both 
with  binary  examples  and  with  real-valued  ones. 
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Random  Boolean  functions:  To  test  the  robustness  of  the  algorithm,  we  have  generated  at 
random  100  Boolean  functions  on  6  input  bits.  As  expected,  a  single  hidden  layer  network  was 
always  found.  The  average  number  of  hidden  neurons  found  was  7.37±1.06,  which  is  very  close  to 
the  (me  obtained  by  Marchand  et  al.  [2]  (7.28i0.82)  and  significantly  better  than  the  results  reported 
in  ref.  [4]  (20.S±3.9)  and  [S]  (about  18  units  in  4  layers). 

I^rity  fkinctions:  We  renuirk  that,  in  the  case  of  Parity  functions  (tested  from  N=2  to  N=8),  the 
algorithm  constructed  networks  with  a  number  of  hidden  neurons  equal  to  rN/2+ll.  The  only 
algorithm  with  a  similar  performance,  that  we  aware  of,  is  Cascade-Correlation  [6]. 


Fig.  2:  The  twin  spirals  problem  training  set  (194  pixels  distributed  in  two  interlcmking  spirals). 

Twin  spirals:  The  twin  spirals  problem  (separating  194  points  from  two  interlcmking  spirals,  see 
Fig.  2)  requires  a  highly  nonlinear  classiBcation  of  real-valued  patterns;  therefore  is  an  extremely 
hard  problem  for  algorithms  of  the  Back-Propagation  family  to  solve  [7].  The  proposed  algorithm 
found  a  solution  with  20  hidden  neurons.  We  consider  to  be  satisfactory  the  resulting  decision 
region  (see  Fig.  3)  The  time  required  for  building  the  network  (with  a  486-based  computer)  was  less 
than  five  minutes.  We  note  that  a  solution  of  the  same  problem  with  Cascade-Correlation  [6]  and 
Upstart  [8]  required  about  one  hour  of  elaboration  time  in  the  same  conditions.  Finally,  we  remark 
that  the  only  other  soluticm  to  twin  spirals  in  a  single  hidden  layer  architecture,  that  we  are  aware  of, 
has  50  hidden  units  [9]. 
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Fig.  3:  The  decision  region  obtained  with  a  20  hidden  neurons  network. 

5  Conclusions 

By  relying  on  the  proposed  constructive  algorithm,  some  of  the  inconveniences  impairing  the 
determination  of  a  supervised  neural  network  can  be  circumvent.  In  particular,  the  architecture  and 
the  connection  weights  are  determined  directly  from  the  training  set  without  trials  and  time- 
consuming  iterative  procedures.  Also  the  generalization  capability  can  be  controlled  by  a  very 
single  pruning  technique. 

In  the  present  case  we  have  considered  the  cascade  procedure  which  is  based  on  a  modified 
simplex  method.  Consequently,  a  possible  improvement  can  be  obtained  by  using  the  neural 
networks  proposed  in  the  teclmical  literature  for  solving  linear  progranuning  [10 — 12]. 
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Abstract 

It  is  well  known  that  the  generalization  ability  of  a  neural  network  can  be  improved  by 
reducing  the  network’s  complexity.  Pruning  methods  based  on  the  idea  of  eliminating  weights 
of  a  well  working  feed-forward  network  have  proved  to  be  successful.  The  theoretical  background 
of  pruning  is  established  through  a  mathematical  model  of  generalization.  A  short  description 
of  three  of  the  most  popular  pruning  methods,  Optimal  Brain  Surgeon  (OBS)  [Hassibi  et  al. 
93],  Optimal  Brain  Damage  (OBS)  [LeCun  et  al.  90]  and  Magnitude  Based  Pruning  (MAG)  is 
given.  These  three  methods  have  experimentally  been  tested  on  standard  benchmark  problems 
known  as  the  MONK’s  problems  [Thrun  et  al.  91].  It  is  shown  that  there  is  no  theoretical 
evidence  for  choosing  one  of  the  methods  compared  to  the  other.  This  was  confirmed  by  the 
experiments  showing  that  all  methods  were  capable  of  reducing  the  number  of  weights  of  a  well 
working  network,  but  none  of  the  methods  was  the  best  every  time.  However,  OBS  was  the 
most  robust,  stable  and  fastest  method  although  it  could  be  caught  in  a  local  minimum.  Both 
MAG  and  OBD  showed  a  fluctuating  performance  according  to  number  of  weights  they  could 
remove  when  the  initialized  conditions  or  learning  parameters  were  changed. 


1  Introduction 

The  main  goal  with  a  neural  netwprk  classifier  is  to  get  a  system  that  is  able  to  classify  unknown 
data  correctly,  i.e.  a  system  with  a  good  generalization  ability.  Vapnik  (1992)  and  many  others 
have  shown  the  relation  between  the  capacity  of  a  network  and  its  generalization  ability.  Further 
it  is  also  known  that  the  capacity  of  a  feed-forward  network  somehow  is  related  to  the  number  of 
units  and  weights  in  a  network.  But  it  is  often  impossible  to  determine  the  exact  capacity  of  a 
neural  network.  In  order  to  overcome  this  problem  various,  more  or  less  mathematically  based, 
methods  have  been  proposed.  A  group  of  methods  focus  on  successively  building  a  network  while 
others  try  to  reduce  the  number  of  units  or  weights  in  a  well  working  network.  This  paper  focuses 
on  the  latter  group  of  methods  known  as  prunings.  Through  a  description  of  generalization  the 
foundation  of  pruning  is  established.  Three  of  the  most  popular  methods  are  described  and  tested 
on  a  benchmark  problem. 

2  Learning  and  Generalization 

This  section  describes  generalization  and  how  it  is  related  to  a  feed-forward  network.  The  learning 
paradigm  considered  here  is  supervised  leariung.  The  error  between  the  output  y  =  /«;(x)  and 
the  desired  output  t  =  /“^(x)  is  normally  calculated  by  an  error  function  E(y,t)  often,  but  not 


m-504 


necessarily,  the  Least  Mean  Square  error  between  y  and  t.  Hence  bold  script  will  denote  a  vector 
or  a  matrix.  The  learning  error  on  a  set  of  known  data,  the  training  set  {x,,  t,},l  <  i  <  m,  is 
defined  as 

ELm=^'£E{yi,ii)  (1) 

and  measures  the  dissimilarity  between  u  and  within  the  restricted  domain  of  input  patterns 
in  the  training  set.  Note  that  it  is  a  function  of  weights  in  the  weight  space  W. 

However,  the  error  of  interest  is  the  expected  error  on  unknown  data,  the  test  set.  This  error  is 
called  the  generalization  error  and  is  defined  as  an  average  over  the  full  distribution  of  input-output 
pairs,  and  can  be  expressed  as: 

Eaiyf)  =  /  £(y.,  t,)P(x,  t)dxdt.  (2) 

where  P(x,  t)  is  the  joint  probability  distribution  formed  by  P(x)  describing  the  distribution  of 
data  in  the  input  space  and  P(t|x)  describing  the  functional  dependence,  so  P(x,t)  =  P(t|x)P(x). 
The  generalization  error  is  the  expectation  value  of  error  for  an  arbitrary  (x,t)  point  drawn  from 
the  P(x,  t)  distribution. 

The  real  goal  of  learning  is  to  minimize  the  generalization  error.  But  the  joint  probability 
distribution  P(x,t)  is  unknown  and  the  only  available  information  is  contained  in  the  training  set. 
In  order  to  solve  this  problem  the  generalization  error  is  replaced  by  the  learning  error,  computed 
empirically  on  the  basis  of  data  available  in  the  form  of  a  training  set. 

For  simplicity  the  following  description  is  restricted  to  the  binary  case,  where  data  from  an  N- 
dimensional  space  are  to  be  classified  as  belonging  to  one  out  of  two  possible  classes.  This  means 
that  y  €  {0, 1}  and  that  the  mapping  function  is  an  indicator  function.  The  error  function 
E(i,y)  is  also  assumed  to  be  an  indicator  function.  This  means  that  E(t,y)  =  0  if  /w(x)  =  t,  and 
E(t,y)  =  1  otherwise,  so  Ea(W)  is  the  probability  of  error,  and  Pi(W)  is  the  frequency  of  error 
on  the  training  set. 

Inspired  by  the  Bernoulli  theorem  Vapnik  (1982;1992)  found  that  with  a  probability  larger  than 
(1  -  q),  simultaneously  for  all  possible  configurations  {W}  the  following  relation  between  Ea{'W) 
and  Pi(W)  independent  of  P(x,t)  would  be  valid: 

EaiyV)  <  El(W)  -h  C{j,  El,  rj)  (3) 

where  C(^,  El,  t])  is  a  confidence  interval,  a  function  which  depends  on  m,  the  number  of  training 
patterns,  h  the  capacity  of  the  network,  El  the  learning  error  and  rf  the  accuracy  parameter 
corresponding  to  the  probability.  The  confidence  interval  is  defined  as: 

=  +  (4) 

where 

»7)  =  ^  ((In  -hl)h-ln  rf^ 

is  essentially  a  function  of  the  ratio 

The  only  unknown  parameter  at  this  point  is  the  capacity  h,  known  as  the  VC-dimension. 
This  parameter  is  very  important  because  it  is  related  to  the  architecture  of  the  network.  The 
capacity  for  a  feed-forward  network  will  correspond  to  the  number  of  units  and  number  of  weights 
and  thresholds  (Baum  et  al.  89). 
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The  size  of  m,  the  data  in  the  training  set  will  normally  be  limited  by  the  problem  domain  or  by 
the  supervisor  in  order  to  learn  something  in  a  reasonable  time.  For  a  fixed  number  m,  the  learning 
error  will  decrease  monotonically  as  the  capacity  h  increases.  Unfortunately  the  confidence  interval 
of  equation  (4)  is  a  monotonically  increasing  function  of  /i  at  a  fixed  m.  This  indicates  that  there 
will  be  an  optimal  hopt  where  the  generalization  error  will  have  a  minimum.  If  an  optimal  capacity 
hopt  of  the  network  could  be  found,  the  number  of  units  and  the  number  of  weights  and  thresholds 
would  be  known.  So  far  nothing  is  said  about  the  values  of  these  parameters.  It  seems  likely  that 
there  will  be  several  weight  constellations  Wi  from  W  that  could  implement  the  desired  function. 
This  corresponds  to  the  curve-fitting  situations  where  it  is  known  that  an  n-  polynomial  to  some 
extent  would  fit  points  from  a  iV-polynomial  where  iV  >  n.  It  would  be  possible  to  find  several 
coefficient  consteUations  that  would  be  equally  valid.  This  was  confirmed  by  Denker  et  al.(1988)  in 
what  they  call  a  perturbation  analysis.  They  took  a  well  working  network  {Ei  =  0),  and  perturbed 
the  network,  moving  the  weights  to  a  new  point  in  the  weight  space,  and  re-trained  it.  They  found 
that  the  network  was  quite  able  to  re-solve  the  task,  returning  to  El  =  0,  but  did  not  do  so  by 
undoing  the  perturbation.  In  fact,  it  moved  in  some  other  direction  and  settled  on  a  new  point  Wi 
in  the  weight  space. 

A  class  of  experimental  methods  used  to  find  a  h  close  to  the  optimal  hopt  is  known  as  pruning 
and  will  be  described  in  the  next  section. 

3  The  theory  and  ideas  behind  pruning 

The  msun  idea  behind  all  pruning  methods  is  to  keep  the  learning  error  El{W)  for  a  weU  working 
feed-forward  network  as  low  as  possible  and  at  the  same  time  to  reduce  the  complexity  i.e.  number 
of  weights. 

One  of  the  simplest  methods  for  reducing  the  complexity  in  a  neural  network  is  called  Magnitude 
Based  pruning  (MAG).  The  method  is  based  on  the  idea  of  eliminating  the  weakest  connection,  i.e. 
the  weight  with  the  smallest  magnitude.  For  the  simplest  MAG  version  the  algorithm  is:  delete 
the  weight  with  smallest  magnitude  and  retrain  the  network,  repeat  until  a  certain  stop  criterion  is 
fulfilled.  There  are  several  more  or  less  sophisticated  versions  of  this  method.  A  widely  used  form 
is  the  weight  decay  (Herzt  et  al.)  gradually  decreasing  the  magnitude  of  the  weights  doing  training 
(not  necessarily  eliminating  any  weights). 

The  next  two  methods  called  Optimal  Brain  Damage  (OBD)  and  Optimal  Brain  Surgeon  (OBS) 
have  a  more  mathematical  approach  using  information  from  the  second  order  derivatives  of  the 
error  function  to  perform  pruning.  They  both  use  the  Taylor  expansion  to  express  an  estimate  of 
how  the  training  error  will  change  as  the  weights  are  perturbed. 

6Et  «  5  GiSwi  +  +  E  -b  (IIWlp) 

i  • 

where  G  is  the  gradient  and  H  is  the  hessian  • 

Both  methods  make  the  assumptions  that  the  network  is  trained  to  a  point  where  the  gradient 
is  zero  so  the  first  term  in  the  equation  can  be  neglected  and  that  the  "quadratic”  approximation 
assumed  that  the  cost  function  is  nearly  quadratic  also  holds  so  that  the  last  term  in  the  equation 
can  be  neglected. 

The  OBD  method  additionally  assumes  that  SE  caused  by  deleting  several  parameters  is  the 
sum  of  the  SE's  caused  by  deleting  each  parameter  individually  so  the  off-diagonal  part  of  the 


in- 506 


second  order  term  is  zero.  The  equation  is  then  reduced  to 

SEt  = 

The  6El  term  is  called  saliency  amd  expresses  the  change  in  the  cost  function  due  to  the  eliminations 
of  weights.  The  algorithm  is:  delete  the  weight  with  smallest  saliency  and  retrain  the  network,  repeat 
until  a  certain  stop  criterion  is  fulfilled. 

The  OBS  does  not  assume  that  the  off-diagonal  of  the  Hessian  is  zero.  Instead  it  reformulates  the 
goal.  The  elimination  of  Wj  can  be  expressed  as:  Swj  -f  Wj  =  0  or  eJSw  +  Wj  =  0  where  ej  is  the 
unit  vector  in  the  weight  space  corresponding  to  (scala)  weight  Wj.  The  goal  is  then  to  solve: 


min 

j 


such  that  ej 6w  -f-  wj  =  0 


or  expressed  in  terms  of  Lagrange  Multiplier 

SEl  =  ^6wjH6wj  +  ^(efSw  +  wj) 

By  taking  functional  derivatives  the  foil  owing  equations  appear: 

6w  =  and  SE^,  = 

The  SEi,  term  is  again  called  saliency  and  expresses  the  change  in  the  cost  function  due  to 
the  eliminations  of  weights.  The  Sw  indicates  how  all  weights  should  be  adjusted,  according  to 
the  elimination  of  a  weight.  This  means  that  the  network  does  not  demand  retraining.  The  only 
’’learning”  parameter  involved  is  an  a  which  comes  from  using  a  particular  data  vector  and  the 
Sherman-Morrison  formular  in  calculations  of  the  Invers  Hessian.  The  algorithm  is:  delete  the 
weight  with  smallest  saliency  and  adjust  the  other  weights  according  to  6w,  repeat  until  a  certain 
stop  criterion  is  satisfied. 


4  Test  and  experiments 

The  three  methods  were  tested  on  the  MONK’s  problems  (Thrun  et  al.  1991).  They  designed  3  fully 
connected  networks  trained  by  a  backpropagation  with  weight  decay  (BPWD)  that  outperformed 
all  other  approaches  (  network  and  rule-based)  on  these  problems  in  an  extensive  machine  learning 
competition.  The  goal  here  was  to  find  how  many  weights  could  be  eliminated  by  the  different 
methods  and  still  perform  as  well  as  Thrun  et  al.  did.  The  result  from  these  experiments  is  shown 
in  table  1. 

Comment  to  MONK  1:  The  MAG  used  a  standard  backpropagation  with  weight  decay  with 
learning  rate  ri  =  0.1  and  decay  rate  7  =  0.00001.  It  needed  only  3  epochs  to  perform  as  well  as 
the  OBS  and  was  better  with  22  epochs.  The  OBD  used  the  same  learning  procedure  as  MAG 
with  learning  rate  tj  =  0.1  and  decay  rate  7  =  0.0001,  but  needed  300  epochs  to  perform  as  well 
as  the  MAG.  Both  methods  were,  however,  highly  sensitive  to  changes  in  the  learning  parameters 
(Epochs,  Tf,  7).  This  is  contrary  to  OBS,  which  shows  the  same  performance  as  long  as  a  was  kept 
between  10"^  and  10“^. 

Comment  to  MONK  2  and  MONK  3:  The  learning  parameters  for  MAG  and  OBD  were  respec¬ 
tively:  epochs  =  10,  17  =  0.1  and  7  =  10~®  and  epochs  =  100,  t]  =  0.1  and  7  =  lO"*.  Other 
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1:  Number  of  wdghts  needed  for  MAG,  OBD  and  OBS  to  make  the  same  performance  as 
the  Backpropagation  with  weight  decay  (BPWD)  found  by  Thrun  et  al.(1991),  on  the  MONK’s 
problems. 


ccHnlunations  of  learning  parameters  were  tried  but  the  best  performance  was  obtmned  with  the 
above-mentioned  parameters.  The  result  with  these  parameters  is  shown  in  table  1.  It  will  be 
possible  to  optimize  their  performance,  by  training  other  parameter  combinations. 

Many  similar  tests  were  made  on  these  MONK’s  problem  given  new  start  conditions.  The  new 
start  conditions  were  created  by  training  the  ori(^al  network  from  random  weights  so  the  weight- 
start-position  of  the  test  network  would  differ  from  test  to  test. 

The  result  was  the  same,  all  methods  were  capable  of  reducing  the  number  of  weights,  and  none  of 
the  methods  was  the  best  every  time.  Adjusting  the  parameters  in  the  learning  algorithm,  MAG 
and  OBD  performed  as  well  as  the  OBS  and  sometimes  better.  However,  OBS  was  by  far  the  most 
robust,  stable  and  fastest  method. 

5  Discussion  and  Conclusion 

In  sectkm  2  it  was  shown  that  there  could  be  many  possible  weight  constellations  (points  in  the 
weight  space)  that  would  yield  equally  valid  generalizations  for  an  optimal  or  nearly  optimal  ca¬ 
pacity. 

Although  OBD  and  OBS  have  established  a  mathematical  foundation  to  get  a  good  estimate 
of  the  saliency,  they  still  do  not  make  a  quantitative  statement  of  how  well  they  improve  the 
generalization.  From  equation  (3)  it  can  be  justified  that  they  will  improve  the  generalization.  The 
same  is  also  true  for  the  magnitude  based  method.  So  it  seems  there  is  no  theoretical  evidence 
that  one  method  will  improve  the  generalization  more  than  the  other.  Experimentally  this  was 
confirmed  by  showing  that  there  were  several  optimal  solutions  to  the  same  problem. 

But  the  experiments  also  showed  that  OBS  was  by  far  the  most  stable  and  robust.  If  it  was  not 
the  best  it  was  always  close  to  the  best.  The  way  OBS  works  is  that  it  will  stay  close  to  the  local 
minimum  from  where  it  begins  the  pruning,  try  to  remove  one  weight  and  then  project  the  error 
surface  to  a  space  one  dimension  smaller  that  the  previous.  This  means  that  it  will  stay  at  the 
same  local  minimum  error  in  the  surface.  This  will  work  very  well  as  long  as  the  local  miniTnnm 
from  where  OBS  starts  is  close  to  the  global  minimum.  But  if  the  original  (start)  minimum  error  is 
not  the  smallest  local  error  OBS  will  never  find  this  point  because  it  does  not  retrain.  Both  OBD 
and  MAG  will  be  able  to  find  such  new  local  minimum  all  depending  of  their  learning  algorithm. 
Hassibi  et  al.  (1992)  has  shown  through  an  example  of  a  5  node  solution  to  the  XOR  problem  that 
only  the  OBS  will  always  be  able  to  remove  the  right  weight  while  both  OBD  and  especially  MAG 
often  fails  to  do  so.  The  reason  OBS  works  perfectly  here  is  that  the  minimum  error  at  the  start 
is  equal  to  the  global  minimum.  This  is  often  the  case  for  smaller  networks  and  OBS  will  probably 
be  the  best  in  these  cases. 

The  general  problem  with  the  strategies  used  by  MAG  and  OBD  is  that  they  are  one  step  predic- 
Ums,  which  means  that  they  take  one  ’’optimal”  step,  recalculate  the  conditions  and  take  another 
"optimal”  step.  There  are  no  guaranties  that  these  two  steps  together  are  optimal.  How  optimal 
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tlie  steps  are  all  together  will  depend  on  the  recalculations  of  the  conditions  which  are  directly 
dependent  on  the  learning  algorithm  and  its  p-uameters.  These  methods  are  therefore  unstable. 
This  explains  how  both  MAG  and  OBD  were  b.i.le  to  perform  better  them  OBS  on  the  Monk  1 
problem  while  OBS  in  general  is  more  stable  th<:<.  the  others. 

To  choose  among  these  different  methods,  considerations  of  the  memory  requirement,  computing 
time,  etc.  should  be  made.  For  real  life  application,  where  the  complexity  might  be  very  large 
the  following  pruning  scheme  seems  reasonable.  Starting  with  MAG  until  the  network  has  a  size 
that  will  allow  the  use  of  OBS,  and  use  some  effort  in  order  to  find  "good”  local  minimum.  When 
such  a  minimum  is  found  OBS  should  be  used  to  do  the  rest  of  the  pruning.  When  OBS  stops  the 
network  should  be  retrained  in  order  to  get  0  and  OBS  should  be  started  agsdn. 
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Abstract 

Using  die  inqxjitant  property  of  the  eppraximaling  a  posteriori  probability  functions  of  the  classes 
in  the  outputs  of  the  trained  multilayer  peic^irons,  we  propose  the  technique  for  the  implementation  of 
sequential  classification  by  peicqiliQO  and  multilayer  perceptron.  and  qj|dication  to  the  node  growing  in 
the  niunber  of  iiqwt  nodes  ^percetron  and  the  number  of  hidden  nodes  of  the  multilayer  percqttion.  A 
measurement  ftir  die  ordering  of  hidden  nodes  of  the  trained  multilayer  peicqitroo  is  also  proposed.  The 
ordermg  the  hidden  nodes  comes  from  the  contribution  at  each  hidden  no^  Using  the  node  growing 
technique,  die  minimum  number  of  hidden  nodes  can  be  obtained  in  the  training  and  used  in  the 
classifkation.  The  technique  can  also  lyply  to  the  single  layer  perceptron.  In  the  experiment,  the  tyincal 
"XOR”  problem  is  apfdied.  The  balance  between  the  reduction  (Chidden  nodes  and  classifkation  results 
isquhegood. 


Introduction 

When  performing  the  back-propagation  algorithm  on  multi-layer  perception,  the  number  of  layers 
and  the  number  of  the  hidden  nodim  in  layers  have  to  be  determined.  Related  papers  [1]  have  shown  that 
the  feed-forward  networks  with  one  hidden  layer  are  c^xdile  of  accurate  approximation  to  an  arbitrary 
nuqjping  provided  that  sufficient  hidden  nodes  are  ava^le.  So  two-layer  percqptron  is  used  in  this 
study.  Too  many  hidden  nodes  in  the  two-layer  perceptron  may  take  longer  computation  time.  Too 
small  hiddni  nodes  nuy  not  skive  the  proUem.  As  for  how  many  nodes  should  be  used  in  the  hidden 
hvcTf  no  abaolutB  criteria  can  be  follow^  It  seems  to  dqiend  on  the  problems  to  be  confronted  with. 

One  feasible  method  of  obtaining  a  neural  networks  widi  an  appropriate  number  of  hidden  nodes  for 
a  particular  problem  is  to  start  with  a  larger  net,  then  prune  it  to  the  dmired  size.  Many  previous  issues 
of  research  has  mentioned  it  [2].  Such  a  smaller  net.  owing  to  the  reduction  (rf  synaptic  connections,  is 
more  effidem  in  both  forward  computations  and  leandng. 

Sequential  classification  (SQ  is  quite  important  in  statistical  pattern  recognition.  Its  application  to 
pattern  classification  was  mentioned  by  Fu  [3],  and  could  be  emidoyed  widely  to  a  number  of  fields  such 
as  industrial  process  and  biomedical  diagntrais.  The  property  of  keeping  the  balance  between  the 
misclassified  error  and  the  cost  of  feature  measurements  makes  SC  a  feasible  method  with  practical 
importance  and  theoretical  interest  By  taking  feature  measurements  sequentially  and  terminating  the 
sequential  process  (making  a  decision)  when  the  proper  stopping  criterion  is  achieve^  a  desirable  accuracy 
of  classification  can  be  obtained  and  the  cost  of  taking  feature  measurements  is  also  acceptable.  In  this 
study,  SC  schemes  app^  to  back-ptopagaiian  trained  single-layer  and  two-layer  percqnrons  are  proposed 
for  dynamically  prunii^  the  network.  The  number  of  hidden  nodes  can  te  reduc^  The  minimum 
numbw  of  hiddn  nodes  can  also  be  determined. 

Sequential  Classifleation  by  Two-layer  Perceptron 

In  this  study,  we  adopt  an  important  key  property.  The  outputs  of  the  output  nodes  of  the 
multilayer  perceptron  are  approximating  the  posteriori  probability  functions  of  the  classes  being  trained 
[4],  item  be  seen  that 
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Oi  -  P  ((OilX) 


(1) 


Here,  Oi  and  02  are  die  outputs  of  the  first  nd  second  ouiputDoJ6s.(ii>inda>2  ire  the  class  1  and  2.  X 
is  the  in^  vector. 

There  are  m  oittput  nodes  in  the  output  layer  to  denote  the  classes.  At  the  n-th  stage  of  the 

process,  the  vector  H  of  the  hiddeo  nodes  is  <  hi,  h2 . Iiq  >.  The  graph  rqxesentation  of  the  n-tfa 

stq)  in  this  sequeiMial  classification  process  is  presented  in  Figure  1. 

1  2  m 

Output  Layer 


Hidden  Layer  hi 


input  Layer 

12  p 


Figure  1  The  n-th  step  In  this  sequential  classifleatiou  process. 


The  generalized  sequentiaiprobalMlity  ratio  of  taking  n  hidden  nodes  is  computed  as  fdlows 

PI: 


U.(»JX)  =  — 


Pi^iXMXj/iKd).) 


n  pJxtaio|=’  [ft  i>j(<iikix)p(x)/p((in)l'^ 

Lk=i  J  Lk-i 


Consider  that  the  sigmoidal  function  is  the  activation  function,  then 


UniOhlX)  - 


P) 


Where  Wjk  is  the  weight  from  the  j-th  hiddeo  node  to  the  k-th  ouqnit  node,  hj  is  die  output  of  the  j-th 
hiddennode.  6^  is  the  bias  of  the  k-di  output  node.  And  the  modified  stopping  boundary  is  taken  as 


[ft  [ft  (l  * 

l+exp|-XWjJij.e.j 
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gj(ii)»log 


1  -Cm 


(4) 


lliea.  ktgUnCaikK)  tt  conpired  with  the  Mppiiig  booDdvy  of  the  s-di  pMiBni  clan.  gi<ii).  if 


logUnCoOblX)  <  gs(ii).  s  «  1. 2,  ...  m  (S) 

not  oooadeied  io  the  dan  Olb.  Rqien  the  rejectioa  coodiiioa  Bndl  ooe  class  is  left. 'Ilien  the  patteni 
assigned  to  the  class. 


Algorithm  1 

Scqncntial  classification  with  growi^  of  hidden  nodes  for  m*claaa  problem  on 

two*la7cr  network 

Inpnt  :  A  back*propagation  "trained"  two^ayer  network  with  ordered  hidden 
nodes  and  a  set  of  "testing"  patterns. 

Oatpat  :  The  classification  resalt  of  every  testing  pattern  by  seqaential 
classification  of  MLP. 

Step  1.  Present  an  inpat  pattern. 

Present  an  input  vector  X  in  the  mpot  layer. 

Step  2.  Set  the  selected  namber  of  hidden  nodes  starting  from  one. 
ttBl.where  n  is  the  number  of  the  seieted  hidden  nodes. 

Step  3.  Cakalate  the  computed  ontpnta  through  the  selected  hidden  nodes. 

Calculaie  the  computed  output  of  k-th  ouipm  node  from  the  n  selected  hidden  nodes. 

Ok= - r* - 

l4«xp(-£  Wjkhj  -  8k) 
j 

where  Wjk  is  the  weight  between  the  j-di  hidden  node  to  the  k'ifa  output  node.  0k  is 
die  bias  of  the  k-th  output  node. 

Step  4.  If  a  snfikient  or  desirable  accuracy  classification  is  not  achieved, 
add  one  hidden  node  and  go  to  step  3. 

Setclasss«  1 
*B|Wt( 

Calculate  the  sequential  probability  ndo 


If  Un(Ci)klX)>gy(n).  go  to  step  S. 
dsesBS+l 
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Add  one  hidden  node,  a  « iH- 1  nd  go  ID  aiqi  3. 

Step  5.  Repeat  hj  Going  to  Step  1. 

_ Go  to  step  1  uatii  all  the  testing  nattama  ae  clasnfied. 


Ordering  of  Hidden  Nodes  on  Neural  Networks 

bi  lias  anidy.  difiering  firom  the  K4.  expansion  method  in  feature  onkriag.  a  technique  based  on  the 
property  of  neuial  netwoiks  is  used.  The  hidte  nodes  of  the  two-h^percepiroo,essen^y,  behave  as 
the  general  Ceaime  extiaclon  except  for  their  non-Unearities.  A  conc^  can  be  realized  that  a  feature 
measurement  is  more  important  if  it  can  separate  more  tniaiqgsanqdes.  Hence  die  classUicatioo  abili^ 
of  a  node  can  be  measured  depending  on  die  separability  of  the  training  sanyles  between  their  desired  class 
and  the  others  by  just  consider  it  in  the  layer. 

For  a  two-layer  pcrcqSion  in  m-class  case,  consider  that  the  activation  function  is  a  sigmoidal 
function,  and  the  j-di  hidden  node  has  the  classificatioo  error  tptjj^  of  the  k-th  output  node  for  the  s-th 
kaming  sanqde. 

9rjk«Dk - 7—^ - T  ® 

l+«Xp(-WjkOj+6k) 

Where  Ok  is  the  desired  output  of  the  training  sam|des  for  die  k-ifa  output  node.  Oj  is  the  ouQwt 

of  the  j-th  hidden  node.  Wjk  is  the  wei^  from  the  j-th  hidden  node  to  the  k-th  output  node.  6^  is  the 
bias  (  threshold  )  of  the  k-th  output  node.  From  (6),  the  i  th  hidden  node  contributes  the  total  mean 
square  error  yj  for  all  mouqiut  nodes.  And  from  all  n  training  sample,  yj  is 

C7) 

mxni-i  k-i 

Sort  the  nodesand  let  yi>y2>  ■  '>yht  where  h  is  the  number  of  hidden  nodes. 

A  node  with  smad  mean  square  error  indicates  that  it  is  helpful  to  correct  classificatioo.  while  one 
with  large  mean  square  error  shorn  the  contrary.  In  the  sequential  classificadcn  process,  the  nodes  should 
be  taken  according  to  the  ascending  order  oi  their  mean  square  error  for  the  purpose  of  terminating  the 
process  earlier. 


Experiments 

The  experiment  is  the  classification  ot  'TiOR"  problem.  We  use  four  training  patterns  to  train  this 
two-layer  network.  After  back-propagatioo  training,  we  take  100*100  *  10,000  testing  patterns  for 
ctassifiradon.  They  form  a  square  in  a  two-dimensional  qiace  with  their  X  and  Y  coordinates  which  are 
fixxn  (-  O.S,  -  0.S)  to  (1.S,  1.S)  and  increase  by  0.02.  Through  the  SniT  procedure  of  the  sequential 
classificatioo,  the  us^  hidden  nodes  2-3-2  two-layer  percqmon  are  shown  in  Table  1.  The 
classification  results  for  the  2-3-2, 2-4-2, 2-5-2, 2-6-2, 2-7-2, 2-8-2  and  2-9-2  two-layer  perceptrons  are 
shown  in  Figure  2.  The  overall  re^tioo  of  the  net  size  can  te  seen  in  Table  2. 


Total  No.  of  testinK  samoles. 

10000 

No.  of  hidden  nodes  used  m  the 
SFRT  Diocedure. 

1 

2 

3 

No.  of  classified  samples 
widittsed  hidden  nodes 

iiiimgn 

3685 

1513 

1.6711 

Table  1  Used  hidden  nodes  for  2-3-2  two-layer  perception  with  sorted 
bidden  nodes. 
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(•)  (b)  (c) 


Figarc  2.  The  clauifkatioB  results  of  tbe  "XOR”  probicui.  (a)  shows  the 
classifkatiou  result  bj  asiug  all  the  hiddea  aodes  for  each  network,  (b) 
is  the  classifkatioa  result  of  taking  nnsorted  hidden  nodes  by  SC.  (c)  is 
the  classification  result  of  taking  sorted  hidden  nodes  by  SC. 


No.  of  hidden  nodes 
used  in  the  SUIT 
process. 

3 

4 

5 

6 

■ 

8 

■ 

Average  No.  of  used 
hidden  nodes  in  the 
SPRT  process. 

1.67 

1.70 

1.67 

1.67 

1.68 

1.98 

1.81 

Reduced  rale 
(tf  network 

44.3% 

57.5% 

66.6% 

72.2% 

76% 

75.3% 

79.9% 

Table  4.  Net  pruning  results  for  the  two*layer  perceptron  with  sorted  hidden 

nodes. 


Conclusions  and  Discussion 


In  this  paper,  the  implementatioo  of  sequential  classification  by  percelion  and  multilayer  peicetron 
is  iHoposed  and  an  efficient  net  [mining  effect  is  achieved.  An  important  key  property  is  adop^  in  the 
derivation  of  sequential  classificatian.  This  prcqieity  is  that  the  ouqiuts  of  the  multilayer  percqition  are 
iqiproximating  die  posteriori  probability  functions  of  the  classes  teing  trained.  The  formular  for  the 
cxdering  of  hidden  nodes  of  multilayer  peiceinn  or  iiqMjt  nodes  of  peroetron  is  proposed. 

In  the  experiments,  our  method  gets  a  good  result  to  prune  multi-layer  network.  The  results  of 
[mining  the  2-4-2, 2-S-2, 2-6-2, 2-7-2  two-layer  networks  to  2-2-2  two-layer  networks  are  the  minimum 
number  of  hidden  nodes  used  in  the  classification  of  the  "XOR”  probelm  which  is  the  same  as  the 
derivation  of  Mirchandani  and  Cao  [S]. 

The  proposed  algorithm  can  be  ^iplied  in  the  single  layer  perceptron  with  2-cIass  and  multi-class 
problems.  If  the  numbw  of  hidden  layers  is  more  than  1,  then  our  proposed  technique  may  also  be 
iqiplied.  The  pruning  procedures  start  fir^  tbe  hidden  layer  close  to  the  output  and  sequentially  prune  the 
hidden  layer  backward  to  die  input 
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Abstract 

We  define  layered  networks  with  horizontal  connections  as  networks  having  units  that 
receive  inputs  from  the  lower  layer  and  also  from  the  previous  units  of  the  same  layer. 
We  show  that  architecture  with  horizontal  connections  does  not  require  as  many  units 
in  the  hidden  layers  as  the  plain  layered  architecture  in  order  to  approximate  a  function. 

1.  Introduction 

Performance  of  a  learning  network  depends  both  on  the  architecture  and  the  algorithm 
of  the  network.  For  example,  the  Cascade  Correlation  (Cascor)  algorithm  performs 
better  for  both  the  two  spiral  problem  [Fahlman  1991]  and  simple  applications  [Blonda 
1993]  than  does  the  Backpropagation  algorithm.  Cascor  is  an  example  of  the  multilayer 
architecture  with  horizontal  connections  and  an  incremental  algorithm.  One  reason  for 
the  superior  performance  of  this  architecture  is  the  ability  to  capture  regions  where  the 
function  being  modeled  is  constant  with  a  fewer  imits.  In  this  paper  we  demonstrate 
that  architectures  with  horizontal  connnections  require  fewer  units  compared  with  the 
plain  multilayer  architecture. 

The  multilayer  perceptron  network  with  k  units  in  a  single  hidden  layer  computes 
functions 

fix)  =  Z  w.//(a.x  -Cj)  ( 1 ) 

y-i 

This  network  was  proved  [Leshno  et  al.l993]  to  be  a  universal  approximator.  The 

function  (1)  can  approximate  any  continuous  function  f{x)  =  [0,1]"  R  provided  that  H 
is  a  non-pol)momial  activation  function  satisfying  mild  conditions.  Another  architecture 
[Bliunl991],  a  constructive  one,  requires  a  three-layer  network.  We  use  Blum's 
technique  to  demonstrate  that  nets  with  two  hidden  layers  and  with  horizontal 
connections  require  fewer  units. 
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2.  Approximation  by  a  network  with  two  hidden  layers 

Approximation  capabilities  of  two  hidden  layer  networks  were  studied  and  estimates  of 
rates  of  approximation  were  derived  [Bluml991].  For  simplicity,  let  us  assume  that 

/  €  LjffO.lf ).  Piecewise  constant  functions  on  rectangular  partitions  of  [0,lf  are  dense 
in  I^dO.lf ).  A  function  can  be  approximated  by  a  piecewise  constant 

function  /.  This  function  can  be  modeled  by  a  two-hidden-layers  network 

7  =  X“’'  (2) 

where  the  summation  is  over  all  rectangular  boxes  and 

/  =  i  -  X,)]  - 2n  +  0.5  (3 ) 

is  the  indicator  function  (one  hidden-layer  network)  for  an  n-dimensional  rectangular 
box  (  ai<bi^ ,  <  Xj  <  ^  i  ^  n  =  4)  of  a  rectangular  partition  of  [0,1]*.  The  H  is  the 

left-hand  continuous  Heaviside  function,  H{z)  =  \,for  2>0,  H(z)  =  0,  z^O. 

Note  that  the  activation  function  H  can  be  a  general  sigmoid  a.  The  Blum's 
approximation  results  and  the  resvilts  that  follow  for  Heaviside  functions  hold  for 

general  sigmoids,  <t(z)-»1  as  z-»oo  as  z->-~. 

This  stems  from  the  fact  that  a  general  rescaled  sigmoid  converges  to  a  Heaviside 
function, 

ofnz)^!  for  z>0,0  for  z<0,  0(0)  for  z  =  0  as 

and  consequently  II <T(nz)  -  ^(z)ll^i  0 . 

3.  Layered  networks  with  horizontal  connections 

Definition.  Networks  with  horizontal  connections  have  units  that  receive  inputs  from 
the  lower  (input)  layer  and  also  from  the  previous  units  of  the  same  layer 

O’*  =  Oi(a*x  -  h*  +  (4) 

/<* 

Note  that  a  network  with  horizontal  coimections  suggests  an  incremental  algorithm.  In 
the  n+lst  iteration  a  new  unit  is  added  and  all  coefficients  are  calculated 

and  the  other  coefficients  associated  with  the  first  n  units  are  kept  unchanged.  Examples 
of  incremental  algorithms  are  the  Cascade  Correlation  [Falman  1990]  and  the  projection 
pursuit  algorithm  [Jones  1992]. 

In  two  dimensional  case,  n=2,  the  function 

is  the  indicator  fimction  of  a  set  which  is  a  uiuon  of  vertical  rectangular  strips,  fig.l. 
Similarly,  the  fimction 
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is  the  indicator  function  of  a  set  which  is  a  union  of  horizontal  rectangular  strips.  By 
replacing  uiuts 

+  (8) 

with  (6),(7),  the  indicator  functions  of  a  piecewise  rectangular  region,  we  obtain  a  two 
layer  network.  This  network  can  model  non-convex  figures,  e.g.  a  semicircle  fig.  lb. 

4.  Estimate  for  the  upper  bound  of  units  needed  in  three-layer  networks  with 
horizontal  connections 

Following  Blum  (1991)  let  us  assume  that  we  have  a  square  mesh  of  size  1/m  for 

dimension  n=2.  Then  we  need  units  (3)  in  the  second  hidden  layer  and  4(m+l)  units 
(8)  in  the  first  hidden  layer  of  the  plain  multilayer  net.  The  total  number  units  for  a  two- 

hidden-layer  network  is  +  4(m  + 1) . 

Some  piecewise  constant  functions  can  be  conveniently  implemented  by  units  with 
horizontal  connections.  To  illustrate  that  a  net  with  horizontal  connections  can  reduce 
the  requirement  on  the  number  of  units  in  the  second  hidden  layer,  let  us  consider  the 
indicator  function  of  a  2-dimensional  chessboard  pattern  (black=l,  white=0,  with  an 
even  number  of  rows),  fig.2.  Two  successive  lines  of  squares  can  be  implemented  by 
two  units  (each  with  one  hidden  layer),  e.g.  the  indicator  function  of  the  first  two  lines 
of  the  chessboard  (a  'zigzag'  function)  can  be  defined  as  follows 


/o  =  H,(x2  -  (1  /  -k/m)) 

bi=0,  b^=l,for  k  even,  b^=-l,for  k  odd 

7^.0,  =«[7.  + A -2  +  0-5] 

Similarly  we  can  construct  an  indicator  function  for  other  zigzag  indicator  functions. 
This  process  requires  two  units  per  zigzag,  or  m  units  for  the  second  layer,  rather  than 

units  when  we  model  (using  plain  units  of  type  (3))  one  square  at  a  time.  The  two- 
hidden  layer  network  for  the  chessboard  requires  only  m  +  4(m  + 1)  units.  Similarly  we 
can  construct  two-hidden-layer  networks,  with  fewer  units,  computing  indicator 
fimctions  for  other  geometric  figures,  e.g.  semi  circles,  spirals. 

We  can  reduce  the  number  of  required  units  for  any  nonconvex  indicator  function: 
Theorem.  A  function  /€L^([0,1]"  which  is  a  constant  on  a  nonconvex  measurable 
subset  S  can  be  represented  by  a  two-hidden-layer  network  with  horizontal 
connections.  This  network  has  fewer  units  in  the  second  hidden  layer  than  the  plain 
network  (2). 
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Proof;  The  function  /  can  be  approximated  with  arbitrary  precision  by  a  piecewise 
constant  function  f,  where  the  nonconvex  set  S  is  approximated  by  a  piecewise 

rectangular  region,  S.  The  boundary  of  S  consists  of  at  least  one  L-shaped  set  of  3 
rectangles  (fig.  6)  or  of  the  saddle  type  pairs  of  rectangles  (fig.  3)  where  the  function.is 
constant  across  the  L-shaped  set.  A  one-hidden-layer  unit  with  horizontal  connections 
can  represent  the  indicator  function  of  an  entire  set  of  four  rectangles  (either  fig.  3  or  fig. 
6)  whereas  with  a  plain  architecture,  4  units  (one  for  each  rectangle)  are  required.  The 
indicator  function  of  an  L-shaped  set  of  3  rectangles  is  represented 
by  =  «[jf,  - fci  +  biH(bj  - xj]  . 

The  saddle-type  representation  is  shown  below. 

5.  Examples  of  two-layer  networks 

Blum  (1991)  showed  that  a  saddle  type  function  cannot  be  implemented  by  a  two  layer 
net  with  a  linear  output  unit.  However,  if  we  admit  horizontal  connections  we  can 
construct  a  one-hidden-layer  solution. 

5.1  XOR  example 

An  indicator  fiinction  of  a  strip 

0  for  Xj  ^  0.25 

/(x)=0  for  X2>0.5 

1  for  0.25  <X2S  0.5 
is  implemented  by  a  two  layer  net, 
fs^  =  - fc,  - (1  - b,)H(x2  - b2))A  =  0.25, =  0.5 

Rotation  and  slight  rescaling  of  this  strip  results  in  a  two  layer  net,  fig.4, 
fxor  -  -X2-Xi+  (2.5 -b^)H{x2  “ =  0.25,fc2  =  0.5 

that  separates  points  (0,0),  (1,1),  of  the  unit  square,  from  points  (0,1),  (1,0). 

5.2  Saddle  type  function  implementation 

Besides  an  XOR  function,  the  two  layer  network  with  horizontal  connections  can 
implement  a  saddle  type  function,  fig.3, 

1  for  X,  <0.5,X2  <0.5 
/(x)  =  ■  1  for  X,  >  0.5, x^  >  0.5  • 

0  otherwise 

The  net  implementing  this  function  has  4  units  and  a  linear  output  unit 
/«<4&  =  ^ib  -  X2  -  bH{Xi  -  a))  +  ff(x2  -  £>  -  (1  -  b)H{a  -  x,)) 
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Fig.l  Rectangular  strips 


Fig.2  A  chessboard  function  Fig.2b  A  semi  circle 


xl  x2 

Fig.3  The  saddle  type  function  net 
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Principle  and  Methods  of  Structrue  Variation 
in  Feedforward  Neural  Networks 
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Abstract:  In  the  past  several  years,  a  number  of  techniques  for  training  feedforward  neural  net 
works  were  presented.  Many  problems  such  as  local  minima  and  training  speed  have  been  discussed 
and  were  solved  to  some  extent.  However,  generally  speaking,  gradient  descent  methods  were  used  in 
most  of  these  papers.  Since  there  exists  complex  nonlinearity  in  feedforward  neural  networks,  train¬ 
ing  by  gradient  descent  methods  alone  is  sometimes  inefficient.  In  this  paper,  principle  and  methods 
of  structure  variation  in  feedforward  neural  networks  are  systematically  presented.  They  also  have 
their  basis  in  anatomy  similar  to  the  Back  Propagation  algorithm  and  ednsist  of  three  parts:  network 
expansion,  network  construction  and  network  compression.  Each  part  contains  several  methods. 
Here  multilayer  preceptrons  are  regarded  as  White  Boxes  whose  weights  and  thresholds  are  control¬ 
led  by  us.  In  contrast,  traditional  gradient  descending  methods  treat  multilayer  perceptrons  as  Black 
Boxes.  It  is  appropriate  to  use  the  gradient  descent  algorithms  and  structure  variation  methods 
alternately.  This  combined  method  has  been  discussed  by  several  papers.  In  this  paper  it  will  be  stu¬ 
died  comprehensively  and  systematically. 

Key  Words:  Feedforward  neural  networks,  structure  variation,  dig  tunnels,  global  minima,  hidden 
neurons 


1.  Introduction 

The  techniques  of  training  feedforward  neural  networks  have  been  studied  and  improved  since  the 
resurge  of  neural  networks.  A  large  amount  of  elTicient  and  powerful  algorithms  have  been  presented 
to  avoid  local  minima  and  speed  up  training  in  most  cases  of  training.  For  example,  for  algorithms  to 
speed  up  learning,  Weir  (1991)  analyzed  the  function  of  learning  rates  in  Back  Propogation  and  im¬ 
proved  the  training  speed  by  self-determination  of  adaptive  learning  rates.  Shoemaker  et  a/  (1991) 
discussed  trinary  quantization  of  weight  updating.  For  the  techniques  of  avoiding  local  minima, 
Wessels  &  Barnard  (1992)  showed  how  to  choose  initial  weights  carefully,  Hirose  et  al  (1991)  and 
Tsaih  (1992)  added  hidden  neurons  during  training  to  skip  from  local  minima.  Among  the  methods 
for  avoiding  local  minima,  simulated  annealing  (Kirkpatrick  et  al,  1983;  Atkin  et  al,  1989; 
Nakayama  &  Normura,  1992)  is  a  fairly  cfTective  one,  but  it  costs  too  much  time.  The  homotopy 
method  (Yang  &  Yu,  1993)  is  promising,  however,  it  trains  the  network  many  times  and  therefore 
costs  much  time.  In  these  improved  techniques,  gradient  descent  methods  are  mainly  used  so  that  it 
can  be  dilTicult  to  handle  intricate  problems  such  as  the  location  of  the  global  minimum  at  a  "deep 
and  narrow  well". 

Since  there  exists  complex  nonlinearity  in  feedforward  neural  networks,  it  seems  that  it  will  be 
inefficient  always  to  use  gradient  descent  techniques.  Thus  some  researchers  turned  to  changing  the 
structrue  of  the  network  while  training,  as  in  the  above  mentioned  method  of  adding  hidden  neurons 
given  by  Hirose  et  al  (1991)  and  Tsaih  (1992).  Nadel  (1989),  Hoehfeld  &  Fahlman  (1992)  added 
hidden  neurons  which  are  connected  to  the  preceding  hidden  neurons  and  to  the  input  layer,  during 
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training  or  construction. 

In  this  paper,  we  will  summarize  and  develop  the  structrue  variation  methods  by  dividing  them 
into  three  parts;  network  expansion,  network  construction  and  network  compression.  Due  to  space 
limits,  only  our  results  about  structure  variation  are  briefly  outlined  in  this  paper  (for  details  see 
Liang  (1993)). 

The  idea  of  structure  variation  has  its  anatomical  basis.  According  to  anatomy,  the  learning 
process  is  not  finished  only  by  the  variation  of  the  intensity  and  polarity  of  connections  among  the 
neurons.  When  learning  becomes  difllcult,  new  neurons  come  into  the  biological  network  to  aid 
learning,  i.e.,  the  network  expands.  In  fact.  One  neuron  memorizing  one  pattern  is  a  special  case  of 
one  neuron  memorizing  several  patterns  or  several  neurons  memorizing  several  patterns,  which  is 
similar  to  some  construction  processes.  On  the  other  hand,  in  the  process  of  thinking,  not  only  the  in¬ 
tensity  and  polarity  of  connections  but  also  the  structure  of  the  biological  network  is  always  chang¬ 
ing  in  order  to  comprehend  and  master  the  knowledge,  which  leads  to  a  compression  of  the  network. 
Network  expansion  means  learning  while  network  compression  means  digesting.  These  processes  are 
also  similar  to  the  learning  process  of  human  beings.  For  example,  when  we  first  learn  a  subject,  we 
feel  that  there  is  so  much  for  us  to  learn,  but  several  years  later,  we  will  feel  that  the  subject  has  nar¬ 
rowed  down  into  a  few  key  points.  In  summary,  when  we  learn  knowledge,  not  only  the  intensity  and 
polarity  of  connections  among  the  neurons  change,  but  also  the  structure  of  the  network  varies. 
Therefore,  the  idea  of  structure  variation  is  based  on  anatomy. 

This  paper  is  organized  as  follows:  Since  the  method  of  network  compression  will  be  used  re¬ 
peatedly  in  the  later  sections,  it  will  be  introduced  in  section  2,  as  well  as  the  relevant  proofs  of  its 
generalization.  Based  on  these  results,  the  idea  and  technique  of  second  learning  and  rotation  trans¬ 
formation  are  developed,  the  latter  showing  ways  to  skip  out  of  local  minima  by  digging  tunnels  hor¬ 
izontally  into  the  error  hypersurface.  In  section  3,  some  construction  methods  for  both  binary  and 
real  training  patterns,  particularly  for  the  parity  problem  and  the  encoder  problem,  are  given.  In  the 
method  of  construction  for  the  encoder  problem,  only  one  hidden  neuron  is  used  and  the  values  of 
connection  weights  and  thresholds  are  polynomially  increasing  so  that  it  is  convenient  to  realize  both 
in  programs  and  circuits.  A  statistical  technique  for  fabricating  multilayer  perceptrons  is  also  dis¬ 
cussed.  In  section  4,  the  principle  and  method  for  network  expansion  are  presented,  in  which  we 
intrinsically  dig  tunnels  down  with  an  inclination  into  the  error  hypersurface,  and  by  which  one  can 
dig  tunnels  down  into  the  error  hypersurface  from  local  minima  to  the  global  one.  Section  5  con¬ 
cludes  the  paper. 

In  practice,  the  methods  of  network  expansion,  network  construction  and  network  compression 
ought  to  be  used  alternately,  e.g.,  in  the  process  of  second  learning  and  rotation  transformation,  both 
expansion  and  compression  are  involved.  After  construction,  in  general,  network  compression 
should  be  considered.  That  is  to  say,  the  methods  of  structure  variation  ought  to  be  taken  as  a  whole 
rather  than  in  separate  parts,  and  it  will  be  powerful  and  efficient  if  we  combine  these  methods,  or 
even  with  gradient  descent  algorithms  in  training. 

2.  Method  of  Network  Compression 

In  this  section,  firstly  a  method  of  pruning  away  the  redundant  hidden  neurons  is  reviewed  systemat¬ 
ically  and  mathematically.  Secondly  the  necessary  and  sufficient  condition  of  the  generalized  value  of 
change  during  the  pruning  process  is  given,  as  well  as  some  relevent  theorems.  Thirdly,  the  steps  of 
learning  the  patterns  in  the  testing  set  which  is  called  the  second  learning  is  addressed,  and  the 
change  of  generalization  ability  during  second  learning  is  proved.  Finally,  the  method  of  rotation 
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transformation  is  discussed. 

2.1.  Pnming  away  redcadaat  hidden  neurons 

In  recent  years,  some  methods  of  pruning  away  the  redundant  connections  were  published.  Some  pa¬ 
pers  used  connection  constraints  during  training  and  removed  the  small  value  connections  after  train¬ 
ing,  e.g.,  Yasui  (1992),  Qin  &  He  (1992).  Using  the  linear  dependency,  some  researchers  presented  the 
method  of  general  compression  (Arai,  1989;  Sperduti  &  Starita,  1992;  Liang  &  Xia,  1993).  Although 
this  method  seems  trival  and  obivous,  it  is  the  starting  point  of  some  useful  methods. 

2.2.  Generalization  analysis  during  compressing 

While  pruning  away  the  redundant  neurons,  generalization  ability  may  vary.  In  this  subsection,  we 
discuss  in  which  case  it  varies  and  in  which  case  it  does  not.  Two  lemmas  and  six  theorems  are 
proved.  See  Liang  &  Xia  (1993)  and  Liang  (1993)  for  details. 

2.3.  Second  learning 

See  Liang  &  Xia  (1993)  for  details. 

2.4.  Rotation  transformation 

Provided  that  the  network  is  not  at  the  global  minimum,  then  adding  a  hidden  neuron  and  pruning 
away  an  old  neuron  in  the  same  layer  is  the  process  of  rotation  transformation.  In  this  process,  the 
outputs  errors  are  not  changed,  so  we  dig  tunnels  horizontally  into  the  error  hypersurface.  Since  a 
hidden  neuron  is  added  and  then  another  is  deleted,  rotation  transformation  will  not  change  the  scale 
of  the  network.  See  Liang  (1993)  for  details. 

3.  Methods  of  Network  Construction 

We  think  that  construction  is  one  part  of  the  structure  variation  methods  because  the  memory  of 
training  patterns  is  finished  mainly  by  structure  variation  rather  than  by  weights  and  thresholds  ad¬ 
justment.  Construction  procedures  can  also  be  found  in  the  human's  brain. 

Although  using  Kolmogorov  Theorem  (1957),  Hecht-Nielson  (1987)  gave  the  existence  proofs, 
he  (1990)  admitted  that  since  it  is  no  constructive  proof,  it  is  not  useful  in  practice.  It  followed  that 
many  papers  dicussed  the  construction  methods  (Arai,  1989  &  1993;  Huang  &  Huang,  1990; 
Kruglyak,1990;  Stork,  1992;  Stork  &  Alien,  1992  &  1993;  Brown,  1993;  Korn,  1993;  Liang  et  al, 
1993;  Liang  &  Xia,  1993a  &  1993b;  etc).  For  the  binary  training  patterns,  Arai  (1989  &  1993),  Liang 
et  al  (1993)  constructed  their  three-layer  perceptrons.  For  the  real  training  patterns,  Huang  & 
Huang  (19SK)),  Liang  &  Xia  (1993a)  gave  some  different  construction  methods.  Besides,  some 
researchers  studied  the  construction  for  two  special  training  sets:  parity  problem  and  encoder  prob¬ 
lem.  For  the  parity  problem.  Stork  &  Allen  (1992)  constructed  a  multilayer  perceptron  with  minimal 
numbers  of  free  parameters,  and  replied  to  Brown  (1993)  and  Korn  (1993)  respectively  in  his  letter  to 
the  editor  (1993).  We  shall  present  a  new  construction  with  fewer  connections  than  before  for  the 
parity  problem.  For  the  encoder  problem,  Kruglyak  (1990)  constructed  a  feedforward  neural  net¬ 
work  solving  the  N-bit  encoder  problem  with  just  two  hidden  units.  In  a  recent  letter  Stork  &  Allen 
(1993)  presented  a  constructive  method  to  the  N-bit  encoder  problem  with  just  one  hidden  unit, 
which  gives  the  minimal  network  architecture  to  this  problem.  However,  in  their  method,  the  input 
weights  and  output  thresholds  are  exponentially  increasing  as  N  increases,  which  leads  to 
iinpracticality  in  designing  learning  algorithms.  In  our  improved  method  (Liang  &  Xia,  1993b),  the 
weights  and  thresholds  are  polynomially  increasing  as  N  increases,  thus  it  may  be  more  useful  to 
learning  algorithm  designers. 

In  this  section,  only  our  methods  are  introduced.  We  will  first  breifly  introduce  the  general 
analytic  express  for  the  binary  training  patterns,  which  is  convenient  to  be  extended  to  solve  some 
problems,  such  as  parity  problem.  In  subsection  3.2,  constructions  for  real  training  patterns  are  dis¬ 
cussed.  Parity  problem  and  encoder  problem  are  constructed  respectively  in  subsection  3.3  and  3.4. 


subsection  3.S  gives  a  statistical  fabricating  technique. 

3.1.  CoBstraction  for  binary  training  patterns 
See  Liang  et  al  (1993). 

3.2.  Construction  for  real  training  patterns 
See  Liang  &  Xia  (1993a). 

3.3.  Construction  for  the  parity  problem 

Minsky  &  Papert  (1969)  gave  a  construction  for  the  parity  problem.  Liang  (1993)  gave  his  construc¬ 
tion  for  the  parity  problem  whose  construction  process  is  just  like  cutting  the  Af-cube  hierarchically 
from  a  vertex.  Recently.  Setiono  &  Hui  (1993)  announced  that  some  W-bit  parity  problem  can  be 
realized  by  less  than  N  hidden  neurons.  We  have  tested  and  confirmed  their  result. 

Besides,  interestingly,  using  the  wave-like  monotone  increasing  activation  function,  Stork  & 
Allen  (1992)  gave  the  minimal  architecture  with  2  hidden  units  for  stardard  three-layer  perceptrons 
(referring  to  the  later  four  conditions  Stork  &  Allen  stated  (1992),  although  strictly  speaking,  the 
network  given  by  Stork  &  Allen  (1992)  also  violates  their  third  condition:  the  unit  step  function  of 
the  output  unit  is  not  a  strictly  mononically  increasing  function). 

Inspired  by  Stork  &  Allen  (1992) 
and  Minor  (1993),  Fig.l  depicts  a 
four-layer  network  with  an  equal 
number  of  free  parameters  and  with 
similar  generalization  as  the  network  of 
Stork  &  Allen  (1992),  but  with  fewer 
connections  than  those  of  Stork  A 
Allen  (1992)  and  Minor  (1993).  This 
network  also  satisfies  the  three  condi¬ 
tions  that  Stork  (1993)  repeated  later 
and  gives  fewer  connections  than  before 
for  standard  multilayer  perceptrons. 

3.4.  Construction  for  the  encoder  prob¬ 
lem 

See  Liang  A  Xia  (1993b). 

3.5.  Statistical  fabricating  method 
See  Liang  A  Xia  (1993c). 

4.  Methods  of  Network  Expansion 

In  this  section,  compensating  methods 
for  multilayer  perceptrons,  which  are 
very  difficult  to  train  by  traditional 
Back  Propagation  methods,  are  presented.  For  a  three-layer  perceptron  trapped  in  local  minima  the 
compensating  methods  can  correct  the  wrong  outputs  one  by  one  until  all  outputs  are  right,  so  that 
the  three-layer  perceptron  can  skip  from  local  minima  to  a  global  minimum.  The  compensation 
methods  use  principle  of  network  expansion.  A  hidden  neuron  is  added  as  compensation  for  a  binary 
input  three-layer  perceptron  trapped  in  a  local  minimum;  and  one  or  two  hidden  neurons  are  added 
as  compensation  for  a  real  input  three-layer  perceptron.  For  a  more  than  three-layer  perceptron, 
the  second  hidden  layer  from  behind  will  be  temporarily  treated  as  the  input  layer  during  compensa¬ 
tion,  hence  the  above  methods  can  also  be  used.  In  compensating,  whenever  a  hidden  neuron  is  ad¬ 
ded  its  input  and  output  weights  and  threshold  are  calculated  deflnitely  rather  than  iterated,  so  the 
global  convergence  is  guaranteed  and  a  lot  of  time  is  saved.  If  the  global  minimum  on  the  error 
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FIg.I.  An  N-\~2-\  feedforward  network  that 
solves  the  Af-parity  problem.  The  values  in  the 
units  are  the  thresholds. 
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hypersurface  is  in  the  "narrow  and  deep  well",  the  multilayer  perceptron  can  be  moved  there  by  com 
pensating.  See  Liang  &  Xia  (1993d  &  1993e)  for  details. 

5.  Conclusions  and  Further  Work 

In  this  paper,  the  principle  and  methods  of  structure  variation  are  presented  and  developed,  which 
consist  of  three  parts:  network  expansion,  network  construction  and  network  compression.  Each 
part  comprises  several  methods  and  its  applications  such  as  second  learning  and  rotation  transforma¬ 
tion.  This  idea  also  has  an  anatomical  basis  like  the  Back  Propagation  method.  In  practice,  it  is  ap¬ 
propriate  to  combine  several  structure  variation  methods  and  even  combine  them  with  conversional 
gradient  descent  techniques. 

Obviously  there  is  still  a  lot  of  work  to  do  in  the  field  of  structure  variation  methods. 
Consequently,  further  work  includes  developing  and  completing  the  methods  of  structure  variation, 
laying  a  more  solid  foundation  for  this  idea,  and  testing  these  combined  techniques  in  applications. 
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Abstract 

A  novel  learning  technique  is  described  as  a  faster  and  more  reliable  alternative  to  the  classical 
backpropagation  method.  The  approach  is  based  on  the  application  of  Least  Squares  criterion  to  a 
linearized  system  at  each  step  of  the  learning  procedure.  The  squared  error  at  the  output  of  each 
layer  immediately  before  the  non  linearity  is  minimized  over  the  entire  training  set  by  a  Block 
Recursive  Least  Squares  algorithm.  The  optimal  weights  (in  the  sense  of  minimal  2-norm  of  the 
error)  are  computed  for  each  layer  by  using  the  QR  decomposition. 

The  high  performance  of  the  new  algorithm  has  been  verified  in  several  experimental  trials, 
yielding  considerable  improvements  from  the  point  of  view  of  both  the  accuracy  and  the  speed  of 
convOTgence. 


1-lntroduction 

The  multilayer  perceptron  is  one  of  the  most  commonly  used  types  of  feed-forward  neural 
networks  and  it  is  used  in  a  large  number  of  applications.  Its  strength  resides  in  its  capacity  of 
mapping  arbitrarily  complex  non-linear  functions  by  a  convement  number  of  layers  of  sigmoidal 
non-linearities.  The  backpropagation  algorithm  (BP)  is  still  the  most  used  learning  algorithm;  it 
consists  in  the  minimization  of  the  Mean-Squared  Error  (MSE)  at  the  network  output  performed 
by  means  of  a  gradient  descent  on  the  error  surface  in  the  space  of  weights. 

The  backpropagation  algorithm  suffers  from  a  number  of  shortcomings;  above  all  the  relatively 
slow  rate  of  convergence  and  the  final  misadjustment  that  can  not  guarantee  the  success  of  the 
training  procedure  in  real  applications.  Great  efforts  have  been  made  to  overcome  these 
limitations  by  introducing  some  heuristic  modifications  to  the  basic  BP  algorithm  [1][2]. 
Anyway  these  methods  require  an  accurate  tuning  of  learning  parameters  in  order  to  obtain 
satisfactory  performance. 

Recently  a  new  class  of  algorithms  has  been  developed,  based  on  Least  Squares  concepts  [3] 
applied  to  the  solution  of  a  linearized  system  for  each  layer  of  the  network.  These  techniques 
generally  offer  more  reliable  training  procedures  and  much  higher  convergence  rates[4][5][6]. 
The  Block  Recursive  Least  Squares  ^RLS)  training  algorithm  allows  to  obtain  considerable 
improvements  fi’om  the  point  of  view  of  both  the  numerical  accuracy  and  the  speed  of 
convergence.  Its  numerical  stability  is  enhanced  by  the  use  of  QR  decomposition  [3]  and  by  the 
fact  that  the  algorithm  works  directly  on  data,  without  forming  any  correlation  matrix. 
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2-Dcscription  of  the  algorithm 


The  presence  of  the  non-linearity  makes  it  difficult  to  apply  to  multilayer  perceptrons  a  number 
of  techniques  so  popular  in  the  field  of  adaptive  filtering  [7].  A  kind  of  linearization  is  needed  in 
order  to  make  available  to  the  problem  of  learning  a  large  number  of  experimented  and  efficient 
algorithms 

The  algorithm  herein  presented  is  based  on  the  idea  of  separating  each  layer  of  the  network  in  a 
linear  part  (the  multiplication  by  the  wei^ts)  and  a  non-linear  one  (the  activation  functions) 
Defining  the  error  immediately  before  the  non-linearity  allows  to  use  the  method  of  QR 
Recursive  Least  Squares  ([3]  [7])  to  update  the  weights  of  each  layer 
In  the  backpropagation  algorithm  the  output  error  at  step  n  is  defined  as: 


E(n)=j;;Ep(n)  (1) 

P 

where  Ep  is  the  output  squared  error  for  the  p-th  pattern. 

The  weights  are  up^ted  by  computing  the  derivatives  of  E  according  to  the  formula; 


Aw|j*^\n)  =-if 


3E(n) 

dw|^^(n) 


(2) 


where  wij(k)  is  the  weight  from  the  i-th  neuron  in  layer  (k-1)  to  the  j-th  neuron  in  layer  (k)  and  r] 
is  the  learning  rate. 

The  learning  rule  thus  derived  is  : 


where  epj(k)  is  the  error  signal  for  the  j-th  unit  in  layer  (k)  and  xpi(k-l)  is  the  output  of  the  i-th 
unit  in  layer  (k-1),  relatively  to  the  p-th  input  pattern.  The  error  signal  is  computed  as: 


=f'(y'rt ’)(<!« 


‘pj 


for  the  output  layer,  and  as: 


(k-rt)  (k-H) 

®pj 


(4) 


(5) 


for  all  the  other  layers.  In  these  formula  ^pj  is  the  j-th  desired  target  output  for  the  p-th  pattern, 
ypi(Jf)  is  the  input  to  the  generic  non  linearity,  being  f()  the  derivative  of  the  non-linear 
activation  fimction,  typically  the  sigmoidal  one.  L  is  the  number  of  layers. 

The  error  signals  above  defined  can  be  used  to  form  a  linear  system  for  each  layer  of  the 
network;  after  each  presentation  of  a  learning  epoch,  this  system  can  be  solved  in  the  LS  sense 
yielding  the  optimal  weights. 


The  new  algorithm  can  be  formulated  in  matrix  notation  in  the  following  way  Let  P  be  the 
length  of  a  generic  epoch.  For  each  layer  the  following  matrices  are  defined; 


'ej(n)' 

X(n)  = 

;  Y(n)  = 

;  E(n)  = 

ej(n) 

where  the  layer  index  has  been  omitted  and  n  indicates  the  generic  iteration.  In  these  expressions 
xiT,  yjT  and  eiT  are  the  input,  output  and  error  row  vectors  relative  to  the  linear  part  of  the 
generic  layer,  for  the  i-th  learning  pattern  (T  indicates  the  matrix  transposition  operation). 
Moreover  we  indicate  with  Q(n)  and  R(n)  the  matrices  deriving  from  the  QR  decomposition  of 
the  system  coefficient  matrix  [3]. 

The  BRLS  algorithm  consists  of  the  following  steps; 

1-  the  weights  are  randomly  initialized, 

2-  the  triangular  matrix  R  is  initialized  to  R(0)=diag(2),  where  e  is  a  properly  chosen  small 
value; 

3-  each  pattern  of  the  current  epoch  is  presented  to  the  network  and  forward  propagated 
through  it;  during  this  phase  the  matrices  X(n)  and  Y(n)  for  each  layer  are  formed; 

4-  for  each  pattern,  the  output  of  the  network  is  compared  to  the  desired  output;  the  error 
signals  (4)  and  (5)  at  the  output  of  the  linear  part  of  each  layer  are  computed.  For  each  layer  the 
perturbation  matrix  E(n)  is  formed ; 

5-  after  the  presentation  of  an  entire  training  epoch,  for  each  layer  the  following  linear  system  is 
formed: 


'  X'/2R(„ -1)  V  J  xV2c(d)  ) 

i(l-X)'/2x(n)j  '  '^(l-X)V2(y(n)+,E(n))J 

where  t/  is  the  learning  rate  (measuring  the  entity  of  the  perturbation  on  matrix  Y)  and  X  is  the 
forgetting  factor.  This  system  is  solved  for  n>0  by  performing  first  a  QR  decomposition  of  the 
coefficient  matrix,  yielding  the  matrices  Q(n)  and  R(n).  In  (7)  C(1)=0  while  for  n>l  C(n)  is 
computed  from  the  formula; 


Q'^(n  -1) 


xV2c(n  -1)  ^ 

(1  -X)V2(Y(n  -1)  +r,E(n  -1)), 


^C(n)^ 


(8) 


Then  a  procedure  of  backsubstitution  on  matrix  R(n)  yields  the  optimal  set  of  weights,  in  the 
sense  of  the  minimal  2-norm  of  the  weight  solution  matrix  W(n); 

6-  if  the  global  output  error  with  the  new  weights  is  still  higher  than  a  specified  threshold  the 
procedure  is  repeated  from  point  3  by  appending  a  new  epoch  (e  g.  the  QR  decomposition  is 
recursively  performed  as  new  data  come);  otherwise  the  training  has  terminated  successfully. 

The  QR  decomposition  (performed  with  either  the  Householder  transformation  or  the  Givens 
rotations)  gives  to  the  algorithm  stability  and  robustness  from  a  numerical  point  of  view.  In 
some  cases  it  can  be  replaced  by  a  Singular  Value  Decomposition  (SVD),  yielding  a  complete 
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control  over  the  internal  structure  of  matrix  X  and  the  regularity  of  the  weight  matrix  W,  at  the 
expenses  of  a  higher  computational  cost 


^^Experimental  results 

The  performance  of  the  BRLS  algorithm  have  been  evaluated  in  several  problems:  parity  (2,3 
and  4  bits),  generalized  XOR,  pattern  recognition  (circle  in  a  square  and  character  recognition). 
In  all  cases  comparison  with  backpropagation  has  been  made  on  the  basis  of  a  proper  number  of 
trials  with  different  configurations  of  initial  weights  and  different  values  of  learning  parameters. 
Main  results  of  this  analysis  are  much  faster  rates  of  convergence  and  higher  accuracy  of  the 
new  algorithm  in  approximating  the  desired  outputs. 

Fig.  1  reports  the  MSE  as  a  function  of  the  number  of  iterations  in  a  typical  case  for  the  XOR 
problem;  both  the  rapidity  of  convergence  (about  30  iterations  to  get  MSE<0.01)  and  its  depth 
(MSE~10-4  after  100  iterations)  can  be  verified. 


The  algorithm  has  shown  also  the  ability  of  forming  sharper  transition  regions.  This  property  is 
shown  in  fig.  2  referring  to  the  circle  in  a  square  problem. 


Fig.  2  ;  3D  output  for  circle  in  a  square  problem 
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The  possibility  of  varying  the  length  P  of  the  training  epoch  (differently  from  other  LS-based 
algorithms)  is  a  peculiar  feature  of  algorithm.  Its  efficacy  is  proved  by  the  good 

performance  of  the  algorithm  in  problems  where  the  training  patterns  are  totally  randomly 
selected  during  the  learning  phase.  Moreover,  with  respect  to  previous  approaches  ([4]  [5]  [6]) 
the  numerical  stability  is  enhanced  by  the  fact  that  the  proposed  procedure  works  only  on  raw 
data  matrices,  without  forming  any  correlation  matrix. 
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Abstract 

A  concniient  implementation  of  the  method  of  conjugate  gradients  for  training  Elman  networks 
is  discussed.  The  paraUelism  is  obtained  in  the  computation  of  the  error  gradient  and  the  method  is 
therefore  applicable  to  any  gradient  descent  training  technique  for  this  form  of  network.  The  experimental 
results  were  obtained  on  a  Sun  Sparc  Center  2000  multiprocessor.  The  Sparc  2000  is  a  shared  memory 
machine  well  suited  to  coarse-grained  distributed  computations,  but  the  concurrency  could  be  extended 
to  other  architectures  as  well. 


1  Introduction 

It  takes  an  exceptionally  large  amount  of  computer  time  to  train  recurrent  networks  because  of  the  added 
complexity  of  the  derivative  calculations.  In  this  work,  we  focus  on  one  type  of  recurrent  network,  Elman’s 
Simple  Recurrent  Network  [1],  and  we  present  a  way  to  distribute  the  gradient  computation. 

Figure  1  shows  a  variant  of  an  Elman  SRN.  This  is  a  partially  recurrent  neural  network  capable  of  learning 
sequence  information.  The  context  units  hold  copies  of  the  hidden  unit  activations  from  the  previous  pattern 
presentation,  and  therefore  the  output  of  the  network  can  depend  not  only  on  the  current  input  but  also  on 
the  entire  input  history. 

Our  network  2U'chitecture  includes  “skip  connections”  that  bypass  the  hidden  layer.  It  has  been  deter¬ 
mined  experimentally  that  these  connections  allow  for  faster  network  training.  They  provide  an  alternate  set 
of  parameters  for  the  linearly  separable,  or  perceptron,  portion  of  the  problem.  See  [2]  for  a  more  complete 
discussion  of  the  rationale  for  these  connections. 

Each  input  sequence  is  an  ordered  set  of  patterns  because  of  the  recurrent  connections  in  the  network. 
These  allow  the  network  to  learn  sequence  information  and  base  its  output  on  the  history  of  the  inputs 
presented  to  it.  At  the  beginning  of  a  sequence,  we  can  set  the  feedback  2u:tivations  to  zero,  so  that  they 
have  no  impact  on  the  output  during  the  first  pattern  presentation.  We  have  found  empiric^dly  that  this 
is  the  best  choice  for  the  initial  conditions.  See  [2]  for  a  more  detailed  discussion.  During  subsequent 
presentations,  the  feedback  units  are  copied  back  from  the  hidden  layer  and  provide  the  context  needed 

*Thia  material  is  based  upon  work  supported  by  the  National  Science  Foundation  under  Grant  No.  IRI-9201987.  Thanks  to 
I}r.  Mark  FVanklin  and  the  Washington  University  Computer  and  Communications  Research  Center  for  the  use  of  their  Sparc 
Center  2000  multiprocessor.  The  Sparc  Center  2000  was  purclutsed  in  part  with  funds  from  NSF  CISB  Instrumentaticm  Grant 
9022560. 
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Output 


Figure  1:  An  Elman  Simple  Recurrent  Network. 


to  m2Lke  decisions  about  later  input  patterns.  Typically,  we  will  present  the  network  with  many  of  these 
sequences  during  training,  in  hopes  that  its  performance  will  generalize  across  the  set  of  all  possible  sequences 
in  some  reasonable  way. 

Some  notational  conventions; 


tpo  The  target  of  output  unit  o  when  the 
network  is  presented  with  pattern  p. 
Opo  The  activation  of  unit  o  when  the 

network  is  presented  with  pattern  p. 
bi  The  bias  value  of  unit  i. 

Wij  The  weight  from  unit  i  to  unit  j. 

T  The  set  of  all  feedback  (context)  units. 
O  The  set  of  all  output  units. 

V  The  set  of  input  sequences. 


We  are  using  an  epoch  based  training  method,  so  the  error  function  is  a  sum  over  all  patterns  of  some 
function  of  the  targets  and  the  actual  activations,  defined  as: 

^  ~  pot  Opo). 

pev  o€0 


Any  gradient-descent  based  method  of  minimizing  the  error  function  will  need  to  cedculate  the  gradient. 
That  is,  we  need  to  take  a  derivative  of  our  error  function  with  respect  to  each  of  the  parameters  of  the 
network.  Taking  V  to  be  some  weight  or  bias  in  the  network,  we  have 


ay 


EE 

p^v  oeo 


ae(tpo,  Opo) 

dY 


(1) 


For  efficiency,  we  should  obviously  propagate  each  pattern  through  to  the  outputs  and  then  calculate  the 
contribution  from  this  pattern  for  every  parameter  in  the  network,  summing  each  term  into  a  global  sum 
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for  that  parameter.  The  propagation  steps  also  must  be  performed  in  a  certain  order  due  to  the  sequence 
information  inherent  in  the  patterns.  However,  we  can  break  up  the  patterns  along  sequence  boundaries,  and 
present  the  sequences  themselves  in  an  arbitrary  order.  This  is  where  opportunities  for  concurrency  arise. 


2  Concurrency 

We  can  partition  the  set  of  patterns  into  subsets  where  each  subset  is  itself  ordered,  but  requires  no  context 
from  any  other  subset.  That  is, 

V  =  {si,S2, . .  .s^r}, 

where  we  can  define  len(si)  to  be  the  number  of  patterns  in  sequence  t.  The  input  sequences  might  be  of 
different  lengths,  and  so  we  need  a  strategy  for  assigning  sequences  to  processors  so  that  the  computation  is 
2is  load-badanced  as  possible.  To  accomplish  this,  we  first  sort  the  sequences  in  nonincre2ising  order  by  len(si), 
and  then  assign  each  sequence  in  turn  to  the  least  loaded  processor.  We  assume  that  the  load  is  proportional 
to  the  number  of  pattern  presentations  required  and  therefore  to  the  total  length  of  all  sequences  assigned 
to  a  processor  so  far. 

After  sequences  have  been  assigned  to  processors,  we  need  to  madce  some  additional  modifications  to 
the  sequenti£d  code.  Since  each  processor  will  be  doing  independent  forward  propagation,  we  will  need  a 
separate  copy  of  the  network  activations  for  each  job.  Since  each  processor  will  be  computing  a  local  sum 
of  the  gradient  components,  we  will  need  a  separate  copy  of  all  the  vuiables  for  each  job.  However,  the 
results  are  to  be  computed  using  only  one  set  of  weights,  and  so  all  of  the  Wij  and  bi  values  can  be  shared. 

The  derivative  calculation  for  recurrent  networks  is  quite  computationally  intense.  It  scales  as  0(||J'||^) 
for  calculating  the  derivatives  with  respect  to  weights  that  connect  input  units  to  hidden  units.  See  [2]  for 
the  details  of  these  calculations. 

Note  that  certain  implementations  of  second  order  methods  may  require  a  line  search  along  the  descent 
direction  indicated  by  the  gradient  in  order  to  find  a  minimum  in  that  direction.  Our  conjugate  gradient 
trainer  uses  such  a  search,  and  we  have  found  that  a  derivative-free  line  search  involving  only  evaluations  of 
$  is  the  most  efficient.  The  above  partitioning  of  input  patterns  can  be  used  equaJly  well  to  speed  up  this 
forward  propagation. 


3  Results 

Our  copjugate-gradient  trainer  is  implemented  using  the  available  C  libreiries  for  multi-threaded  execution 
on  the  Sun  Sparc  Center  2000  multiprocessor  system.  There  are  currently  eight  processors  avmlable  on  the 
system,  but  with  a  coming  operating  system  upgrade,  this  number  should  increase  to  twenty.  The  Sparc  2000 
has  a  shared  memory  architecture  with  two  high  bandwidth  packet  buses.  A  message-passing  implementation 
would  require  duplication  and  update  of  the  Wij  and  hi  V2dues  in  the  local  memory  of  each  node. 

Our  test  problem  was  a  (192-l-9)-13-2  network,  meaming  192  inputs,  9  feedback  units,  13  hidden  units, 
and  2  outputs.  The  goal  was  to  identify  the  language  spoken  in  a  ten-second  sample  of  audio.  The  network 
was  presented  with  successive  400  millisecond  overlapping  frames  of  bandpass  filtered  sound  and  trained  to 
differentiate  between  English  auid  French  speakers.  The  training  patterns  consisted  of  41  sequences  which 
were  divided  over  processors  as  evenly  as  possible.  Table  1  shows  the  division  of  labor.  There  was  a  total  of 
15,457  patterns  in  all  41  sequences. 
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1  Processor 

Number  of  Sequences  Assigned 

6 

6  5 

7  6  5 

9  7  6  5 

11  8  7  6  5 

14  10  8  7  6  5 

21  14  10  8  7  6  5 

1 

41  20  13  10  8  6  5  5 

1  2  3  4  5  6  7  8 

Number  of  Processors  Used 

Table  1;  Division  of  Labor 


Figure  2:  Speedup  of  a  derivative,  or  backpropagation,  epoch  as  function  of  processors  used. 

Figure  2  shows  the  speedups  obtained  for  a  derivative  epoch.  Speedup  was  experimentally  measured 
and  is  the  ratio  between  the  execution  time  of  the  sequential  version  and  the  execution  times  of  each 
multiprocessor  version.  It  includes  all  overhead  for  adding  each  portion  of  the  derivatives  into  a  grand 
sum  for  each  network  parameter.  A  derivative  evaluation,  in  the  sequential  version,  takes  approximately  24 
minutes  for  this  problem. 

Figure  3  shows  the  speedups  obtained  for  the  derivative  free  forward  ev2duation  of  the  error.  A  forward 
evaluation,  in  the  sequential  version,  takes  approximately  13  seconds  for  this  problem.  Typically,  about 
ten  forward  evaluations  are  required  per  conjugate  gradient  epoch  to  perform  the  line  search.  The  speedups 
obtained  here  are  smaller  because  the  computation  is  smaller,  and  the  overhead  of  concurrent  memory  access 
tends  to  drown  out  the  advantages  gained.  The  data  point  for  6  processors  represents  2m  unusually  efficient 
use  of  time.  This  is  probably  because  the  assignment  of  sequences,  as  shown  in  Table  1,  is  unusually  smooth, 
and  the  smaller  computation  is  more  sensitive  to  this  them  the  derivative  calculation. 
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Figure  3:  Speedup  of  a  linesearch  epoch  as  function  of  processors  used. 

Figure  4  shows  the  overall  speedups  obtuned  for  a  copjugate  gradient  epoch.  The  speedups  take  into 
account  time  from  the  sequential  portion  of  the  code  that  computes  the  conjugate  gradient  directions.  A 
training  epoch,  in  the  sequential  version,  takes  approximately  27  minutes  for  this  problem.  This  time  is 
dominated  by  the  derivative  calculation,  and  so  good  overall  speedup  can  be  obtained  by  making  that 
portion  of  the  code  concurrent.  Note,  however,  that  forward  evaluation  is  also  important,  as  indicated  by 
the  data  point  for  6  processors. 

Typically,  a  single  training  run  will  require  htmdreds  of  epochs.  The  overall  speedup  presented  here 
therefore  represents  significant  savings  in  time  over  the  sequential  version.  By  reducing  the  turnaround 
time,  a  greater  number  of  network  architectures  can  be  investigated,  and  connectionist  research  can  be  more 
effective. 

4  Conclusion 

Our  previous  work  focused  on  the  possibilities  for  concurrency  in  one  portion  of  the  derivative  calculation. 
This  allowed  us  to  achieve  some  speedup,  but  it  was  limited  by  the  size  of  the  network  architecture  used, 
and  did  not  allow  concurrent  computation  of  the  forward  propagation  step.  Our  current  trainer,  although  it 
was  more  difficult  to  implement,  allows  us  to  partition  the  set  of  inputs,  which  is  typically  large.  This  lets 
us  use  the  available  hardware  more  efficiently. 

While  the  training  of  recurrent  networks,  even  of  simple  ones,  introduces  myriad  new  complexities  over 
feedforward  network  trmning,  our  algorithm  contmns  opportunities  for  concurrency.  These  opportunities 
can  be  taken  advantage  of  after  a  careful  and  thorough  study  of  the  data  dependencies  involved.  Reducing 
the  real  time  elapsed  during  a  training  run  is  of  great  benefit  to  those  undertaking  connectionist  research 
projects.  It  means  that  more  experiments  can  be  conducted  in  less  time  than  with  sequential  methods. 
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Figure  4:  Overall  speedup  as  functioa  of  processors  used. 
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Learning  the  sigmoid  slopes  to  increase  the  convergence 
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Abstract  :  A  time  constant  is  introduced  as  a  variable  in  the  decision  function  of 
the  perceptron  neuron.  It  is  shown  that  a  time  constant  carefully  chosen 
dramatically  improves  the  performance  of  the  backpropagation  algorithm  on  a 
benchmark.  A  new  learning  algorithm  including  the  time  constant  as  a  variable  is 
developed  in  this  study.  Two  versions  of  the  algorithm  are  detailed.  The 
difference  between  them  lies  in  the  set  of  neurons  to  which  the  new  algorithm  is 
applied.  Both  versions  exhibit  improved  convergence  rate  when  compared  to  a 
backpropagation  algorithm  using  optimum  fixed  time  constant. 


I  INTRODUCTION 


One  of  the  most  popular  decision  function  for  the  neurons  of  a  multilayer  perceptron  (MLP)  is 
the  so-called  sigmoid  function  defined  by: 
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where: 

-  Wji  is  the  synaptic  weight  from  neuron  i  of  layer  1  to  neuron  j  of  layer  1+1, 

-  Oj  is  a  bias  term 

-  yi  is  the  output  of  neuron  i. 

-  n  is  the  number  of  neurons  of  layer  1. 

Using  Uiis  decision  function,  the  poor  performances  of  the  MLP  on  the  XOR  problem  have  been  a 
strong  motivation  of  research  for  improved  versions  [1][2]  of  the  classical  backpropagation  (BP) 
[3].  A  simple  variation  of  the  decision  function  allows  the  MLP  to  achieve  convergence  in  an 
average  of  42  epochs  over  100  runs  of  the  XOR  problem.  A  time  constant  X  scales  Ae  network 
input  to  the  neuron  (i.e.  netj)  before  passing  through  the  non-linear  decision  function: 

1.0 

^  .  n  -X  net: 

1.0  +  e  j 

"Trial  and  error"  method  was  used  to  choose  a  proper  time  constant.  This  approach  is  time 
consuming  and  would  be  even  more  consuming  should  each  neuron  have  its  own  time  constant.  In 
section  n,  the  time  constant  is  introduced  and  its  effects  on  the  discrimination  capabilities  of  a 
neuron  analyzed.  In  section  III,  a  new  algorithm  based  on  BP  is  derived  in  full  length.  Each 
neuron  uses  an  individual  time  constant  learned  during  training.  In  section  IV,  the  dynamic  of  the 
time  constants  is  studied  and  we  draw  the  conclusion  that  the  output  layer  neurons  alone  should 
make  use  of  an  individual  time  constant  thus  reducing  the  computational  burden  induced  by  the 
introduction  of  this  variable  in  BP.  Finally  a  comparison  of  the  new  learning  algorithms  with  a 
standard  version  of  the  MLP  is  carried  out  on  a  2-D  artificial  data  in  section  V.  Section  VI 
concludes  this  paper  by  discussing  the  robustness  of  the  algorithms. 
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n  Time  constant  and  discrimination 

While  used  on  classification  applications,  the  sigmoid  function  exhibits  the  advantage  over  the 
symmetric  tanh(ne^  /  2)  function  that  it  can  be  interpreted  as  a  loose  membership  function  varying 
from  0  to  1.  The  critic  of  slower  convergence  of  the  sigmoid  function  emphasizes  the  need  for  fast 
learning  algorithms.  Consider  a  single  perceptron  neuron  trained  on  a  two-class  discrimination 
imoUem  according  to  an  algorithm  aimed  at  minimizing  an  error  function  defined  by : 

E(t)  =  ^(d(t)-y(t))2 

where  d(t)  is  the  desired  output  at  time  t  and  y(t)  is  the  actual  output  of  the  neuron  at  time  t 

While  learning  unfolds,  this  neuron  must  d^  with  two  opposite  goals :  on  one  side,  the  update 
of  its  weights  is  maximum  when  the  output  of  the  neuron  is  near  0.5  and  on  the  other  side,  its 
objective  value  for  each  pattern  is  either  0  or  1,  those  values  prevent  any  update  of  the  weights. 
Adding  a  time  constant  to  the  decision  function  of  the  neuron  helps  acting  on  its  sensibility  to  input 
patterns.  The  variation  of  the  output  value  according  to  different  values  of  the  time  constant  is 
illustrated  in  Fig.  1.  A  decision  function  with  a  large  constant  tends  to  approximate  a  Boolean 
decision  function  while  the  same  function  with  a  smdl  constant  looks  like  a  quite  linear  function. 
Clearly  the  time  constant  must  be  carefully  chosen  in  order  to  achieve  fast  and  efficient  training. 
One  may  also  wonder  whether  a  fixed  time  constant  is  well  suited  for  learning  since  during  the 
process  of  training  the  behavior  of  the  network  is  bounded  to  change. 


Flg-l  :  Influence  of  the  time  constant  on  the  shape  of  the  decision  function. 

The  XOR  problem  was  referred  to  as  a  toy  problem,  the  decision  of  introducing  a  time  constant 
must  not  be  t^en  on  the  sole  basis  of  the  performance  of  a  neural  network  with  two  hidden 
neurons  and  one  output  neuron.  When  dealing  with  a  real  problem,  such  as  texture  classification, 
the  size  of  the  netwoik  is  an  order  of  magnitude  larger  than  that  of  the  previous  one.  No  a  priori 
knowledge  ensures  the  existence  of  a  single  optimum  time  constant  shared  by  all  the  neurons,  it 
may  turn  out  that  each  neuron  needs  a  specitic  time  constant  Empirical  search  for  these  optimum 
values  may  soon  become  intractable.  This  emphasizes  the  need  for  learned  time  constants. 


ni  Formal  derivation  of  the  new  algorithm 

Since  the  time  constants  will  be  learned,  and  according  to  their  effects  on  the  decision  function 
curve,  they  will  be  referred  to  as  slopes  from  here  on.  The  notations  are  defined  in  Fig.  2. 

The  output  of  any  neuron  is  defined  by : 
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The  error  function  for  pattern  p  is  defined  by : 

k 

where  d^  is  the  desired  output  for  neuron  k  and  y^  is  the  actual  one  when  presented  the  pattern  p. 
Weight  updating  rule  is  computed  according  to: 


In  order  to  preserve  the  consistency  of  the  BP  learning  algorithm,  slope  learning  is  achieved 
according  to : 

For  the  neurons  of  the  output  layer  it  comes  : 

ilE 

awkj 


=  -  (dk  -  yk)Xk  yic  (1  -  Yk)  Yj 
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aEp^ 

aXk 


-(dk- Yk)netkYk(l  -  Yk) 
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For  the  neurons  of  anY  hidden  laYcr  it  comes : 

|^=  -  X  (dk  -  Yk)  Yk(l  -  Yk)  Wkj  Xk  Yj  (1  -  Yj)^j  Yi 

KJ 

^  =  -  X  (dk  -  Yk)  Yk  (1  -  Yk)  Wkj  Xk  Yj  (1  -  Yj)  netj 
aXk  k 


Detailed  formal  derivation  of  the  learning  algorithm  is  given  in  appendix  (for  sake  of  simplicitY  the 
momemtun  term  is  not  introduced  here). 


Fig.  2 :  Notations  used  in  the  derivation  of  the  new  algorithm. 


Some  learning  algorithms  need  to  evaluate  the  exact  energy  function  before  any  weight  update :  the 
entire  training  set  must  be  propagated  forward,  and  the  error  for  each  pattern  estimated,  between 
each  backward  propagation .  Since  the  introduction  of  a  time  constant  do  not  alter  the  core  of  BP, 
the  new  algorit^  takes  advantage  of  the  so-called  stochastic  gradient  algorithm.  After  forward 
propagation  of  a  pattern,  the  weights  and  slopes  are  updated. 


ni-541 


IV  Analysis  of  the  slope  dynamic 


A  test  on  artificial  data  is  used  to  analyze  the  dynamic  of  the  slopes.  Several  runs  were 
performed  on  two-dimensional  difficult  data.  One  representative  run  is  depicted  in  Fig.  3. 
Although  the  neural  network  used  on  this  test  is  a  two-layer  one,  preliminary  results  on  networks 
with  many  hidden  layers  confirm  the  analysis  drawn  below.  Furthermore,  some  additional  runs 
with  random  initial  values  of  the  time  constants  show  that  these  values  do  not  have  much  effect  on 
the  following  evolution. 

An  empirical  study  shows  that  the  slopes  of  the  hidden-layer  neurons  do  not  vary  much  during 
learning.  The  only  variation  is  a  slow  increase  of  these  slopes.  This  can  be  explained  by 
considering  the  presumed  goal  of  the  neurons  of  any  hidden  layer.  These  neurons  try  to  separate 
patterns  accortting  to  their  class :  a  stiffer  slope  gives  a  finer  separation. 

The  slopes  of  the  last-layer  neurons  decrease  at  the  beginning  of  learning  then  increase  until 
convergence  is  achieved.  The  initial  decrease  of  these  slopes  can  be  analyzed  as  a  search  for 
information  in  order  to  ignite  the  process  of  decision-surface  positioning.  Once  those  decision 
surfaces  are  coarsely  positioned,  stabilization  becomes  the  goal  of  the  output  neurons.  The 
increasing  slopes  prevent  broad  variation  of  the  weights  by  saturating  the  output  of  the  neurons. 
Due  to  the  backward  process  of  updating,  small  variations  of  the  weights  of  the  last  layer  neurons 
prevent  large  variations  of  the  weights  of  the  hidden  layers  neurons. 


slopes  of  the  last  layer  neurans  slopes  of  the  first  hidden  layer  neurons 


Fig.  3  :  Dynamic  of  the  slopes 


V  OPTIMUM  VERSION  OF  THE  NEW  ALGORITHM 

The  new  algorithm  as  defined  in  section  n,  adds  a  computational  burden  to  the  already  slow 
and  heavy  BP  algorithm.  Although  the  updating  laws  of  the  weights  and  of  the  slopes  share  a 
common  part,  it  would  be  useful  to  keep  the  additional  computation  to  a  minimum.  The  analysis  of 
the  dynamic  of  the  slopes  suggests  a  way  to  achieve  this  goal.  Since  the  slopes  of  the  hidden-layer 
neurons  do  not  vary  much,  Utose  neurons  may  not  need  an  individual  variable  slope.  A  minimum 
algorithm  including  slope  learning  for  the  neurons  of  the  last  layer  is  derived.  The  updating  laws 
for  those  neurons  are  identical  to  the  previous  ones  :  (1)  and  (2) .  In  order  to  preserve  adaptation 
ability,  a  common  slope  y  that  can  be  set  to  any  arbitral  value  is  kept  in  the  decision  function  of 
the  hidden-layer  neurons.  The  updating  law  for  their  weights  is : 

=  -  X  -  Yk)  Yk  (1  -  Yk)  Wkj  4  yj  (1  -  yj)  Y  yi 

OWji  ^ 
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The  performance  of  the  minimum  algorithm  is  analyzed  on  the  problem  used  previously  in 
order  to  appreciate  the  loss  induced  by  the  simplification.  A  general  robustness  test  was  performed. 
A  standard  MLP  with  a  common  slope  for  every  neuron  trained  consecutively  with  classical  BP 
(CBP) ,  BP  with  individual  slopes  for  every  neuron  (EBP)  and  BP  with  individual  slopes  for  the 
output  layer  neurons  (OBP)  ,  was  trained  10  times  for  2300  combinations  of  the  learning 
parameters  chosen  to  span  the  whole  parameter  space.  Results  are  described  in  the  next  section. 


VI  Conclusion  and  Discussion 

A  new  algorithm  for  the  MLP  has  been  presented.  The  introduction  of  a  variable  slope  in  the 
decision  function  of  the  neurons  increases  the  convergence  speed.  The  additional  computation 
burden  induced  by  the  new  algorithm,  lead  us  to  an  empiricrd  study  of  the  slope  dynamic  that 
points  out  the  uselessness  of  the  variable  slope  for  the  neurons  of  hidden  layers.  A  minimum 
dgorithm  was  then  derived.  It  exhibits  good  {^rformance  results  on  both  artificial  and  real  data 
classification  problems.  A  detailed  performatKes  analysis  can  be  found  in  [4].  Preliminary  studies 
of  the  robustness  of  the  algorithms  ^ow  that  while  the  maximum  number  of  achieved  convergence 
is  obtained  by  EBP :  42.1%  (pointing  out  its  robustness  to  parameters  variation),  the  performance 
of  OBP :  39.6%  is  quite  equivrdent  and  anyway  much  better  than  those  of  the  CBP :  26.5%. 


VII  APPENDIX 

Notations  are  defined  by  Fig.  2.  Derivation  deals  with  a  two  layers  network  for  the  sake  of 
simplicity  (extension  to  higher  dimension  network  is  immediate).  The  output  of  the  last  layer 
neurons  is : 

m 

yic  = - ^ - X  with  netk  =  X  Vj  +  0k 

1.0  +  exp(Xknetkj  j=i 

The  output  of  the  neuron  of  a  hidden  layer  is  given  by : 

n 

y. — H; — V  yi  wji  +  Oj 

1.0  +  exp(Aj  netj/  i=i 

Updating  of  the  variables  (weights  and  slope)  of  a  last  layer  neuron  follows  : 

d  E 

Awkj  (t+1)  =  -  q - —  +  a  Awkj  (t) 

d  Wkj 

AX*  (t+1)  =  -  V  +  K  AXk  (t) 
dXk 

where : 

8Ep  _  8Ep  d  ne^ 

3  Wkj  3  net*  3  Wkj 
3Ep 

=  — =2_.yj 
3  netk 
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and: 


9Ep  _3Ep  dyt 
d netk  dot-  d net^ 

=  -(dk-yk).yk(l-yk)^k 

d  Ep  _  3  Ep  dyt 

d  Xk  d  yk  d  ^k 

=  -(dk-yk).(yk(l  -yk)netk) 


Updating  of  a  hidden  layer  neuron  variable  follows : 

Awji  (t+1)  =  - 11  -P-  +  a  Awji  (t) 

a  Wji 

AXj  (t+1)  =  -  V  — — P-  +  K  AXj  (t) 

a  Xj 

with : 

aEp  _  aEp  a ne^ 

a  Wji  a  netj  a  wji 
-ISl  y. 

a  netj 

aEp  aEp  a yj 
a  netj  a  Oj  a  netj 

=  -XSkWkj.yj(l-yj)Xj 

and 

aEp  aEp  ayj 

a  x,j  a  yj  a  Xj 

=  -X  8kWkj.(yj(l-yj)netj) 
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Tti«  Constrslnt  ■ai«d  Oacompotitlon  Training  Rrchitecture 

Sorio  Drafhki,  DepMtmcM  of  CoiiH)uUMiaQal  Science 
Uoivenlty  of  St  Aodrewa,  North  Haugh.  St  Andrewt,  KYI6  9SS,  UK 

AtMtract  Coiutraiiit  based  deoompoaitioQ  (ODD)  U  a  variatioo  of  the  'divide  and  conquer'  method.  The 
CBD  algorithm  U  composed  by  a  weight  updadog  rule  (any  algorithm  able  to  train  a  single  layer  net),  a  pattern 
pieseatatioo  algorithm  and  a  method  for  coostrocting  the  networic.  CBD  finds  an  architecture  able  to  solve  the 
problem  and  trains  it  a<  the  same  time.  The  search  for  the  solutk»  is  pefformed  by  reducing  the  dimensionality  of 
the  weight  space  and  that  of  the  traiiiing  set  The  training  is  petfonn^  on  subnets  with  subgoals  and  the  weights 
found  in  one  subgoal  training  will  be  conserved  and  form  a  part  of  the  final  solution.  Ibe  training  is  performed 
exclusively  on  the  simplest  possible  type  of  net  one  layer,  one  neuron  and  though  the  resulting  net  is  as  powerful 
as  a  muldhQrer  percept^  llie  pattern  subsets  contain  always  n*l  correctly  classified  and  one  misclassified  pattern. 
The  compotadon  Involved  is  very  simple.  No  derivatives  are  calculated  aod  no  preprocessing  is  needed. 

1.  Introduerion 

Many  training  algorithm  for  feedforwaid  multilayer  networks,  have  important  drawbacks.  Some  of  them  are 
inherently  skw  and  can  be  trapped  in  local  minima.  For  most  of  them,  it  is  necessary  to  start  the  training  with  the 
correa  ar^tecture.  In  the  follo^g,  some  Btetors  influencing  the  training  process  are  reviewed  and  a  new  algorithm 
is  proposed  that  addresses  these  factm.  This  algorithm  is  based  on  constraint  satisfaction  and  constructs  the  network 
di^g  training. 

Architectural  lanes.  It  is  well  known  tha  the  training  difficulty  increases  with  the  complexity  of  the 
architecture,  in  particolar  with  the  number  of  layers. 

A  cert^  architectural  complexity  is  required  because  one  layer  networks  cannot  solve  problems  that  are  not 
linearly  sqiarable.  At  tbe  same  time,  mol^yer  networks,  are  hard  to  train  and  their  training  may  faU.  If  the  network 
has  more  than  one  hkiden  layer,  other  proU^  such  as  the  atteniiatioo  of  the  error  sigi^  appear  [Lang,  1988]. 

Deciding  the  architecture  of  a  multilayer  perceptron  for  a  given  I/O  training  .set  is  a  jMoblem  in  itself. 
Many  algorithms  work  with  a  fixed  architecture.  Therefore,  the  correct  architecture  has  to  be  cfaosra  before  training 
starts.  If  training  fails  it  is  not  clear  if  this  is  because  of  an  insufficient  architecture  or  another  cause.  If  the 
architecture  is  too  rich,  the  solution  weight  state  will  have  neurons  which  provide  unnecessary  information  and/or 
neurons  wiikfa  do  not  contribute  lo  the  solution  [Sietsma,  1991]. 

DioMDSloiuJlty  of  the  weight  apoc*.  The  training  problem  can  be  posed  as  that  of  flnding  tbe 
minimum  of  an  error  surface  over  a  weight  space.  Independently  of  the  algorithm  us^  this  problem  becomes  more 
difficult  as  tbe  number  of  dimensioos  of  tbe  weight  space  inciea^Wileiuky  and  NeuhausfWtlenslcy,  1990)  report 
that  even  fer  a  simple,  lineariy  separable  problem  like  discriminatioo  between  two  N  dimeosional  gaussians,  tbe 
training  time  increases  both  with  tbe  number  N  of  dimensions  for  tbe  same  architecture  and  with  the  number  of 
bidden  units  for  tbe  same  dimensionality  of  tbe  input  ^Moe. 

The  pattern  set  FaUunan  and  Lebiere  in  [Falhman,  1990]  identified  tbe  moving  target  problem  as  one 
of  tbe  causes  of  the  training  problems  for  a  multi-layer  architecture.  This  effect  is  determined  by  tbe  presence  in  the 
training  set  of  many  difTcrent  tasks  to  be  aocomiriisbed.  In  diis  situatioo,  more  than  one  hidden  unit  will  try  to  tadde 
the  same  task.  Only  after  one  of  tbe  tasks  is  accomplished  by  one  or  more  hidden  neurons,  will  other  neurons  be 
redirected  to  other  sources  of  error.  This  is  one  of  tbe  reasons  for  which  tbe  standard  training  is  slow.  This  effect  can 
be  eliminated  if  there  is  only  one  source  of  error  in  tbe  training  set  and  only  one  hidden  unit  to  be  trained.  Tbe 
train^g  problem  is  even  siii^)ler  if  tbe  net  does  not  have  hidden  units. 

There  are  perhaps  counter-examples  for  one  or  more  of  the  statements  above.  These  should  be  taken  more 
like  assumptions  justified  by  some  experiments  rather  than  irrefutable  truth.  However  they  are  useful  in 
understanding  the  technique  which  is  being  proposed.  As  far  as  these  factors  are  concerned,  tbe  ideal  training  problem 
is  to  train  an  architecture  with  only  one  layer,  only  one  neuron  and  to  have  a  unique  source  of  error  iu  tbe  training 
set  This  is  what  the  CBD  techniques  addresses. 

2.  Time  baaed  decompoeitlon  (TBD)  vs.  ccHutraInt  based  decomposition  (CBD). 

A  possible  approach  to  solving  a  problem  is  "divide  and  conquer”.  Split  tbe  task  into  many  simpler  tasks 
and  solve  each  of  them.  There  are  two  fundamentally  different  methods  for  splitting  a  complex  goal  into  sub-goals: 
history  based  deccnqxisition  and  constraint  based  decompositioo. 
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Let  ua  comider  for  tostaoce  a  robot  (with  a  hunuooid  acatoaiy)  situated  in  the  middle  of  a  room  with  the 
task  to  open  the  doer.  The  task  is  complex.  The  robot  has  to  move  towards  the  door.  Perhaps  at  the  same  time,  it 
will  move  its  arm,  raising  it  from  its  normal  position  along  the  body  towards  the  level  of  the  door  knob. 
CoocutTeoUy,  it  will  move  its  dngen  preparing  than  to  grasp  the  door  knob.  During  this  complex  movement,  the 
Vad  and  the  eyes  must  move  In  such  a  way  that  the  door  knob  Is  kept  in  the  centre  of  the  visual  rield  independently 
of  the  positioii  of  the  body. 

Let  us  suppose  we  ask  a  human  to  perform  the  task.  We  ate  going  to  record  this  solution  which  is  one  of 
the  many  possible  sotudens  and  use  it  to  teach  our  robot.  The  solution  will  be  a  sequence  of  Intermediate  posidoos, 
a  path  P  in  the  space  S  of  all  the  possible  posidoiu.  We  call  this  path  a  soludon  path.  Now,  we  could  sample  this 
p«^  by  choosing  a  number  of  iniennediaie  posldooa;  (pi,  p2.-.pn}.  This  is  a  discrete  solution  path.  The  first  sub¬ 
goal  of  the  system  is  to  reach  the  first  point  on  the  path,  the  second  is  to  teach  the  next  point  and  so  on.  Any 
complex  task  for  wfaJefa  we  know  (or  could  design)  a  path  can  now  be  learned.  This  type  of  decomposidoo  will  be 
called  time  based  deoomposidoiL 

There  exists,  however,  another  possibility  to  split  the  task  into  sub-tasks.  For  the  robot  to  accomplish  the 
task,  a  set  of  ooostraints  must  be  satisfied  e.g.  the  robot  must  be  near  the  door  (i.e.  the  distance  between  tbe  robot's 
mAM  centre  and  tbe  door  knob  must  be  less  than  tbe  arm's  length),  tbe  hand  must  be  at  tbe  height  of  tbe  door  knob, 
tbe  fingers  must  be  open  so  that  grasping  the  knob  is  possible,  etc.  Tbe  task  is  characterised  by  a  set  of  censtraints 
(ri.  r2....,rp).  This  set  of  constraints  does  not  depeai  on  the  path  in  tbe  posidon  space  tbe  subject  used  to  reach  the 
Hastate. 

One  could  consider  a  constraint  quee  with  one  dimension  for  each  constraint  in  the  coustraint  set  Fig.  1 
shows  a  possible  training  path  for  a  tlnv*.  based  decomposidoo.  Tbe  variables  characterising  tbe  constraints  vary  all 
at  tbe  -<anv»  dme.  In  each  step,  each  of  them  will  come  closer  to  tbe  value  which  characterises  tbe  solutioo. 

Fig.  2  shows  a  training  path  for  a  constraint  based  decompositloa.  The  subgoals  are  defined  such  that  the 
first  one  includes  the  first  constraint  the  second  one  the  first  two  constraints  and  so  on.  The  first  step  of  tbe  training 
takes  the  net  into  tbe  8ubq>ace  corresponding  to  tbe  first  constraint  The  search  for  tbe  soludon  of  tbe  second  sub¬ 
goal  will  be  performed  in  tbe  subspaoe  ssl  which  is  a  subspaoe  with  n-1  dhnensioos  of  die  n  dimensional  constraint 
space.  The  search  for  tbe  soludon  of  tbe  next  subgoal  will  be  performed  in  a  subspace  with  n-2  dimensions  and  so 
on. 


3.  Theoretical  ftramework 


Definition.  A  constraint  is  a  coodidon  necessary  but  not  sufficient  for  tbe  soludon.  There  must  be 
possible  for  the  soludon  to  be  mqxessed  as  a  set  of  non-oontradictory  constraints. 

Definition.  A  task  is  defined  by  a  set  of  constraints  (ri,  i2t— 4|)-  This  set  of  constraints  defines  a  point 
p  in  the  constraint  space.  Tbe  soludon  of  a  task  is  a  point  W  in  the  weight  space  which  satisfies  tbe  given  set  of 
constraints. 

Observation:  In  a  constraint  based  deconqxnidon  tbe  number  of  coostraint  variables  varies  but  if  a 
variable  appears  in  a  subgoal  it  will  contains  tbe  final  value  of  that  variable  (tbe  value  characterising  tbe  solution). 
In  a  dme  based  decomposidoo,  the  numto  of  constraint  variables  remains  constant  and  equal  to  the  number  of 
constraints  of  the  problem  buttbeirvaluesvary  at  each  stage. 

Definition.  Given  a  task  defined  by  tbe  set  of  constraints  (ri,  r2,...,rD),  a  time  based 
decomposition  (TBD)  is  a  discrete  solution  path  P><pi,  P2>— >Pd)  property  that  pi  is  the  initial  point, 

pn  is  the  solution  and  each  state  pi  satisfies  a  set  of  constraints  (i^i,  r^,...4^n)- 

Definition.  Given  a  task  defined  by  the  set  of  constraints  (ri,  r2....,rn),  a  constraint  based 
decomposition  (CBD)  is  a  discrete  soludon  path  P»(pi.  P2>— .Pp)  the  property  that  pi  is  the  initial  point, 

Pn  is  the  solution  and  each  state  pi  satisfies  the  constraints  (rj,  12.— 4i). 

4.  Constraint  definition.  CBD  as  a  method  to  perform  a  dimensionality  reduction  in  the 
weight  space. 

Fw  present  implementation  ptnposes,  a  constraint  is  defined  as  obtaining  Ae  correct  output  for  patterns 
ntnated  in  a  limited  region  of  the  input  space  or  equivalently  tbe  construction  of  tbe  desired  I/O  surface  above  a 
limited  region  of  tbe  input  space.  Wbra  the  output  is  the  correct  one,  the  coostraint  is  satisfied. 
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This  definiUoa  satisfies  the  oooditioos  for  a  constraints  because: 

a)  The  wbok  I/O  surface  can  be  cut  into  pieces  corrcspoodinj^  to  dUjoiot  regions  of  inrui  ^pace  {ai,  a2. 
...a3}.  In  order  for  the  I/O  surface  S  to  be  the  goal  I/O  surface  Sg,  S  must  be  equal  to  Sg  in  all  of  the  regr  ns  a],  a2, 
...a3.  Tberefore,  S  «  Sglai  (the  coodilioa  that  S  be  equal  to  Sg  In  the  limited  area  at)  is  a  necessary  but  not  sufficient 
cooditioo. 

b)  Tbe  sohitioa  i.e.  the  goal  I/O  surface  can  be  expressed  as  a  set  of  non-contradictory  constraints;  S  is  a 
sdutkxi  if  and  only  if  S  •>  Sglat  for  any  1  from  1  to  n,  where  n  is  tbe  number  of  areas  tbe  input  space  has  been  cut 
into.  Due  to  the  fact  that  tq  are  disjoint  by  deflnitloo.  tbe  constraints  cannot  be  contradictory. 

Tbe  CBD  training  starts  by  training  tbe  first  subgoal  which  requires  the  satisfacdon  of  tbe  first  constraint 
Tbe  second  subgoal  will  ask  tbe  satisfaction  of  tbe  first  two  constraints.  Therefore,  tbe  search  for  tbe  soludon  of  tbe 
second  subgoal  is  performed  in  a  subspace  with  o-l  dtmensions  of  tbe  n  dimensional  constraint  space. 

The  interesting  case  is  wbra  tbe  reduedoo  of  dimensiODality  in  the  constraint  space  can  be  put  into 
correspondence  to  a  reduedoo  of  dimenslooality  in  tbe  weight  space.  In  this  case,  tbe  weights  found  in  one  subgoal 
training  wiU  be  preserved  unchanged  and  will  be  a  part  of  tbe  final  solution.  Having  as  few  diiDensioos  in  tbe  weight 
space  as  possible  was  one  of  tbe  characteristics  of  tbe  ideal  training  situation. 

Tbe  shape  and  the  size  of  tbe  regions  of  input  space  used  in  defining  tbe  constraints  is  very  important.  Tbe 
shape  and  the  she  chosen  should  dqteod  on  tbe  proMem.  Ideally  this  should  be  done  automatically,  by  the  training 
alg^thm. 

Search  directed  by  sabgoab 

The  search  dtrected  by  subgoels  characterises  a  situation  in  wbkh  there  is  only  a  weak  coupling  between  tbe 
constraint  space  and  tbe  weight  space.  A  reduction  of  dimensions  in  tbe  constraint  space  could  but  does  not 
necessarQy  cooespond  to  a  reduction  of  dimensions  in  the  weight  qiace. 

Tbe  simplest  form  in  which  tbe  CBD  idea  can  be  Implemented  is  to  define  the  subgoals  by  flitting  tbe 
training  set  into  subsets.  This  is  roughly  equivalent  lo  qjUtting  tbe  bput  space  into  disjoint  regions  and  taking  as  a 
training  set  of  a  subgoal,  tbe  patterns  in  t^  region.  A  constraint  is  getting  the  correct  output  for  a  subset  of  tbe 
training  set  A  subgoal  is  the  training  of  a  increasing  number  of  constraints. 

In  order  to  check  the  effects  of  this  CBD,  one  could  simply  train  (with  a  standard  weight  change  algorithm) 
tbe  subgoals  correqKnding  to  tbe  chosen  constraints.  In  constraint  space,  tbe  net  is  asked  to  reach  the  first  subgoal. 
From  this  point,  the  net  is  trained  with  tbe  second  subgoal.  No  measures  to  ensure  that  the  net  will  remain  in  tbe 
sub^iace  corresponding  to  tbe  first  subgoal  are  taken.  The  question  is  wbetber  tbe  net  will  be  able  to  preserve  tbe 
information  obtained  by  tbe  training  of  tbe  first  subgoal  in  the  training  of  the  second  one. 

This  will  be  shown  by  tbe  evoiution  of  tbe  error  for  tbe  patterns  in  the  first  subset  during  tbe  training  of 
tbe  second  subgoal  and  so  on.  If  this  error  remains  small  k  will  mean  that  the  search  for  the  solution  to  tbe  second 
subgoal  is  directed  by  the  subqwce  ooneqwndiog  to  the  first  one.  If  tbe  error  goes  up.  it  will  mean  that  tbe  first  few 
weight  changes  in  the  second  training  sesskn  have  thrown  the  net  fkr  from  the  subqMce  conesponding  to  the  first 
subgoal  and  tbe  first  training  was  useless. 

This  experiment  could  show  tbe  importance  of  tbe  pattern  presentation  algorithm.  If  tbe  result  of  tbe 
training  can  be  substantially  changed  by  changing  the  pattern  presentation  algorithm,  a  training  algorithm  must  be 
seen  as  tbe  combination  of  a  weight  changing  algorithm  and  a  pattern  presentation  algorithm  rather  than  a  weight 
changing  algorithm  alone.  A  substantial  change  would  be  for  instance  the  success  of  the  CBD  pattern  presentation 
algorithm  in  some  proUem  where  tbe  batch  pattern  presentation  algorithm  fails.  Both  should  use  tbe  same  weight 
updating  rule. 

Tbe  CBD  pattern  presentetion  algorithm  tries  to  ensure  that,  for  each  training,  tbe  position  of  !he  initial 
weight  state  in  relation  to  the  position  of  tbe  goal  is  good.  This  is  achieved  by  training  exclusively  on  pattern  sets 
contaming  mostly  patterns  that  ^  net  is  already  able  to  respond  to  correctiy. 

Search  restricted  by  sabgoab.  The  CBD  oet. 

In  the  case  of  a  search  restricted  by  subgoals,  Uxse  is  a  strong  connection  between  the  constraint  q»ce  and 
tbe  weight  spare.  A  reduction  of  dimensions  in  the  constraint  space  b  put  into  a  direct  correspondence  with  a 
reduction  of  dimeoskns  in  the  weight  space.  Tbe  wei^ls  found  by  training  a  subgoal  will  become  a  part  of  tbe  final 
solution. 

Since  toe  purpose  is  to  train  only  few  weights  at  a  time  and  to  keq>  those  weights  unchanged  afrerwards, 
tbe  idea  of  constru^g  tbe  net  during  the  training  comes  naturally  in  one's  mind.  Tbe  CBD  algorithm  is  formed  by 
a  CBD  pattern  presentation  algorithm,  a  construction  mechanism  for  building  the  net  and  a  weight  change  algorithm 
for  a  single  layer  perceptron  (delta  rule  for  instance). 
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To  UlustTMc  the  CBD  algorittun.  k(  lu  cnuider  tbe  exjimpk;  nf  a  ciftssificutioo  probiem.  WitJXMit  k)&s  of 
geDcrality,  we  sluU  coosider  only  two  claues  Cl  aad  C2  in  an  o  diniemkmS  input  space.  Tbe  problem  is  deTincd  by 
a  set  of  pauenu  for  each  class.  llterB  apt  two  output  units  01  and  02,  ono  ftsr  each  class. 

Tbe  CBO  aJjoritbm  starts  with  tbe  input  units,  ooe  hidden  unii  ind  tbe  bias  unit  (pcnnaneatly  set  to  1). 
For  the  classifkatioo  problem  osse  can  use  threshold  units  with  (-l.-t-l)  output  range  and  0  threshold.  For  problems 
in  which  a  precise  analogue  output  is  desired,  ooe  can  use  a  different  type  of  activation  function  with  tbe  same 
algorithm. 

Let  Cj  =  be  tbe  set  of  patterns  in  the  class  Cl  ind  Cj  = 

set  of  paoenss  in  class  C2. 

The  first  stage  is  to  construct  a  hidden  layer  (tbe  byperplane  layer)  which  has  a  hidden  unit  for  each 
byperplaoe  neoessxy  for  tbe  separatlaa  of  tbe  regkxu  beloogiog  to  different  classes.  Tbe  result  of  this  stage  is  a  set 

of  hyperplanes  hi.  h2 . 1%  a  set  of  lenns  Ti  of  tbe  fonn  Ti«si£n<hi)hi...8ign(h|^^.  Cj)  where  signfhi)  csn 

be  l.-l  or  nilandj  caobe  1  or  2.  This  Is  equivalent  to  building  a  piecewise  linear  boundary  between  classes. 

Each  byperplane  divides  the  space  into  two  regkxu  ooe  positive  and  ooe  negative.  A  hyperplane  and  its 
sign  form  a  factor.  A  factor  is  used  lo  represent  the  cortespooding  half-space  determined  by  the  byperpiane.  A  term 
is  obtained  by  performing  a  logical  and  between  facton.  Not  all  tbe  hyperplanes  must  cootribute  a  factor  to  all 
tbe  terms.  Fi^y,  a  logical  or  is  performed  between  terms  in  order  to  obtain  the  expressioo  of  tbe  solutioa  for  each 
class. 

Tbe  algorithm  for  the  first  stage  (building  and  trabiiDg  tbe  byperplane  layer)  is  presented  in  Eg.  7. 

Tbe  algorithm  is  presented  as  a  recursive  procedure.  lu  parameters  are;  a  r^km  of  the  space  (initial  value  = 
whole  space),  the  training  set  divided  into  two  sets,  ooe  for  each  dass  (initial  value  -  tbe  whole  training  set)  and  a 
factor  (initial  value  ■  nil).  Tbe  factor  dcsoibes  tbe  r^kxi  aad  nil  conesptxxls  to  tbe  whole  space. 

Tbe  CBD  algorithm  starts  by  building  a  tubgoal  with  only  two  patterns,  one  from  each  class.  A  unit 
(which  will  become  a  hidden  ooit  in  tbe  final  net)  will  be  added  and  trained  such  as  it  separates  the  two  patterns.  This 
training  proUem  is  tbe  simfdest  proUem  ooe  can  have:  only  ooe  layer  and  only  ooe  unit  It  is  assumed  that  this 
training  will  succeed.  Let  h  be  tbe  hyperplane  obtained  by  this  training.  This  byperplane  will  be  saved.  A  new 
paltem  (from  any  class)  will  now  be  added  jo  tbe  current  subgoal.  Tbe  same  unit  will  be  traiiied  again.  The  training 
yrctiiem  is  again  tbe  simi^est  poatible:  one  layer,  ooe  unit  and  tbe  pattern  set  contaiiu  oitly  ooe  misclassified 
example.  If  tbe  training  suoeeds,  tbe  pattern  will  remain  in  tbe  cunent  subgoal  md  the  new  byperplane  will  be  used 
subsequendy.  If  the  training  fadls.  dte  old  iuyperplafie  will  be  restored  and  tbe  pattern  will  be  deleted  from  tbe  cunent 
subgo^.  Tbe  proc^  continues  until  all  tbe  panems  in  tbe  training  set  have  been  considered.  In  tbe  simplest  case, 
the  failure  of  a  training  is  detected  by  imposing  a  timeout  oooditkn  in  number  of  epochs  or  monitoring  tbe  weight 
changes  and  stopping  tbe  training  when  the  error  evolution  becomes  asymptotic. 

Tbe  byperplane  resulted  at  tbe  end  of  this  process  will  divide  tbe  space  bto  two  half-spaces  bt-  and  b-.  If  bf 
contains  ooly  patterns  in  tbe  same  class  C^.  b+ wiO  be  added  to  the  cunent  factor  and  tbe  result  classified  as  dass  Cj. 
Tbe  region  rented  will  be  tbe  intersectioa  between  die  initial  region  and  bf.  Therefore,  the  factor  characterising  the 
new  region  will  be  (factor  and  Ih-).  If  1h-  is  not  homogeneous  (it  contains  patterns  in  both  classes)  tbe  algorithm 
will  be  applied  again  to  Ih-  regkxi.  The  same  is  done  for  h-. 

Tbe  next  stage  is  very  simple  and  does  not  need  training  at  all.  CBD  builds  another  layer  with  a  unit  for 
?ach  term  Tt>sign<bi)hi...sign(b]c)hk.  Let  us  consider  tbe  unit  associated  with  Tj.  Tbe  bias  wd^t  w.bias,  will  be 
set  at  an  arbitrary  negadve  value  (e.g.  -0.5).  The  unit  will  be  connected  only  with  tbe  units  correqxxxUng  to  ibose 
hyperplanes  in  Tj.  Tbe  values  of  tbe  weights  will  be  all  equal  to  x  where  x  is  tbe  solution  of  tbe  following 
inequality; 


where  fan.in  is  tbe  number  of  hyperplanes  present  in  Ti.  Tbe  first  inequality 


e^iisares  that  tbe  unit  win  be  mmed  on  ^aU  of  tbe  units  are  in  tbe  stale  required  by  tbe  sign  of  their  corresponding 
factors.  The  second  inequality  ensures  that  tbe  unit  will  remain  off  if  even  a  single  unit  has  die  wrong  activation. 
For  tbe  cbosen  type  of  neurons,  tbe  threshold  is  0.  In  this  formula,  tbe  bias_wdgbt  represents  the  absolute  value  of 
the  bias  wdgbL  The  sign  of  each  weight  will  be  tbe  sign  of  the  corresponding  byperplane  in  T].  This  unit 
impletsents  a  logical  and  and  will  be  turned  on  if  and  only  if  the  input  pettem  is  in  tbe  region  characterised  by  Tj. 

There  will  be  a  unit  for  each  term  in  tbe  loludou  given  by  tbe  algorithm.  Finally,  another  layer  of  wei^ts 
will  implement  a  logical  or.  This  layer  will  contain  a  uixit  for  each  class  (2  uni's  in  this  case)  and  each  unit  will  be 


x> 


r  ^ 


threshold + bias_  xueig 
threshold + htas_vmg 
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ccMUMCted  wtib  tbe  was  conv^KwUiV  ^  clMt  on  tte  previow  laya.  The  weicbc  cao  have  aay  v^ue  greaier  than 
the  ihteabold  (any  poahivc  value  for  0  tlveahoU). 

In  coocluaicn,  the  CBD  algorillMa  buUds  a  net  with  3  Uyen  of  active  weifhu.  The  ftnl  layer  impkoenta 
liypeq>laoes  which  lepanuea  the  pattenu  iaio  regiooi  cooiaioing  only  pattenu  in  the  ume  class.  The  second  layer 
impleaieais  a  logical  and  between  different  hypcrplaocs.  In  the  let  theory  language  this  layer  implements  an 
interacctloo  between  half-spaces  given  by  different  hyierptanes.  Each  unit  on  this  layer  will  be  activated  only  by 
input  {MkOems  situaaed  in  a  hoedbgeoeous  regions  of  the  iapot  space  and  can  be  associated  with  their  output  dass. 
The  third  layer  impkments  a  logical  or  between  units  oa  tte  itoood  layer.  In  other  words  It  performs  the  rennkw 
of  different  regkxa  cerreapooding  to  the  same  daas.  The  typical  (bal  archiiecture  of  the  net  is  presented  in  fig.  3. 

Another  optko  is  to  use  only  two  layers.  The  first  orse  is  built  b  the  same  way  and  the  second  one  is 
trained  with  the  "delta"  rule  or  my  other  well-known  leambg  algorithms  for  sbgle  byer  networks.  Because  the 
hyperplanes  are  b  the  correct  poaitloa,  the  probiem  is  now  separable  ta  the  hidden  layer  aedvatioo  ^moc  and  the 
trabbg  can  succeed. 

The  CBD  algorithm  cm  be  easily  extended  to  cope  with  mote  than  two  classes.  The  trabbg  speed  can  be 
further  reduced  by  performbg  an  simple  oo-lbe  inatysls  cf  the  current  subgoal  and  the  current  solutioo  but  these 
enbaocements  are  beyond  the  scope  of  the  present  paper. 

Exporlaaouts. 


1.  Search  directed  by  snbgoala.  Pattern  presentation  algorithm. 

The  experiments  were  done  with  a  dassificeiioo  network  with  a  128-20-36  architecture.  The  net  is  used  to 
classify  characters  of  the  English  al^ihabet  (10  digiu  and  26  letters). 

The  trabbg  patterns  were  obtained  from  images  of  car  number  plates.  The  image  is  segmented  bto  number 

plate  and  background  and  the  nmber  plate  is  s^meated  bto  characters.  Each  character  area  is  bbarised  and  divided 

bto  8  by  16>128  rectangles  and  a  mean  hunbance  value  is  calculated  for  each  rectangle.  These  128  luminance 
values  are  normalised  and  the  result  of  this  normalisatioo  constitutes  the  input  to  the  net  The  output  is  a  vector  of 
with  all  thr.nnf.cnrrespnndbgtp  the  cfaatacter  presented. 

Due  to  various  character  sets  lued  b  the  number  plates,  different  Ulnmbadon  conditbos  and  different 
positioos  of  the  camera  with  respect  to  the  car.  the  differences  between  various  instances  of  the  same  character  are 
rather  large  b  spite  of  various  oornudlsatians  performed.  As  a  consequence,  varbus  instances  of  the  same  character 
win  be  ^xead  over  a  large  volume  b  the  kyut  qiaoe.  The  trabbg  set  contains  180  patterns. 

Two  types  of  experimertts  were  performed.  The  first  type  compares  the  irainbg  of  a  consirabt  based 
(bcompositbn  approach  with  respea  to  the  standard  trabbg  appro^.  The  second  type  of  experiments  bvestigates 

b  more  detafl  the  behaviour  of  the  oonsttabt  based  decompositbn  and  shows  that  the  search  is  indeed  directed  by  the 

subgoals. 


A.  Conatralnt  based  deconqpoeitioa  veraaa  standard  training. 

The  standard  of  trabbg  the  wfaob  trabbg  set  is  cooqiared  with  a  coostrabt  based  decomposidon 

approach.  The  weight  chaagbg  mechanisms  is  the  generalised  delta  rub  [MeClenand,  1986). 

b  each  trial,  two  networks  were  initialised  with  the  same  initial  weight  state  and  used  the  same  values  of 
the  above  parameters  duibg  the  wbob  trabbg  process.  Oue  network  used  the  dassbal  technique  of  trabbg  with  die 
wbob  set  of  patterns  and  the  other  was  tndned  with  a  coostrabt  based  decomposidon  of  the  trabbg  set. 
Subsequently,  the  standard  frainbg  was  tried  with  different  parameters  (especially  leambg  rate)  but  it  was  never 
sucoes^. 

The  standtfd  approach  trabbg  fails  ta  converge  b  15(XX}  epochs  whereas  the  CBD  trainbg  converges  to  an 
error  limit  of  0.3  b  approx.  13200  epochs  and  lo  m  error  bnit  of  0.2  b  approx.  13600. 

As  the  final  performances  of  the  net  depend  uldmalely  on  the  error  limit  for  the  last  sub-goal  only,  the 
speed  of  the  trabbg  can  be  dratnadcilly  increased  if  a  higher  error  limit  is  used  to  detect  the  end  of  a  sub-goal 
training  An  ctTor  limit  of  approx.  0.75  for  the  subgoals  reduces  the  total  trabbg  time  (b  epochs)  by  approximately 
a  half.  This  btermediate  error  limit  depends  very  much  ou  the  problem. 

Note  that  an  epoch  for  the  whole  training  set  necessitates  the  calculation  of  the  weight  changes  detennined 
by  the  entire  number  of  trabbg  patterns  wdiereas  I  epoch  for  a  sub-goal  (rabbg  set  necessitates  only  the  cakulatkn 
for  the  niimtvT  of  the  patterns  b  the  sub-goal  trai^g  set  Therefore,  the  CPU  time  needed  for  an  qiocb  b  the 
stoidard  technique  will  be  muc^ioGger  than  the  dme  needed  for  an  q)och  for  any  sub-goal  but  the  last  one  which  is 
the  whole  trabbg  set 

As  disenssed  b  the  btroduedon,  the  results  of  the  generalised  della  rule  as  a  weight  updating  algorithm  can 
be  improved  usbg  various  techniques.  Their  combbadon  can  also  be  used  with  a  coostrabt  based  decomposition 
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pauera  presenialioa  aJgorittua.  Ii  U  believ«d  that  the  use  of  taoM  of  the  odmlquet  would  no(  affect  esscniially  the 
ovcnU  result  of  Lbc  compamoo. 

A  strict  constraint  based  decooipotltioo  would  ask  tub-goals  fomi,d  by  adding  the  charactcn  one  by  one  i.c. 
the  first  sub-goal  is  impleincoted  by  a  training  set  formed  with  insl  of  the  first  character,  the  second  with 

imtaoces  of  the  first  two  characlen,  etc.  This  is  this  Is  ioefficieot  bee _ »e  all  the  units  in  the  output  layer  whose 

class  is  not  pre^t  in  the  curreot  subgoal  will  tend  to  have  0  weights.  Ir  '  ;  coodiUoos  the  distance  in  weight  spttce 
betMrecn  the  initial  positioo  and  the  solutkn  would  be  very  large  for  each  subgoal.  Without  any  other  precautkms, 
tbe  CBO  paocre  prcscautico  algorithm  would  not  be  aMe  to  help  the  training.  For  this  reason,  the  first  subgoal  was 
built  with  a  paoern  from  each  class  ensuring  that  tbe  first  subgoal  offers  a  fair  start 

B.  lavcat^atlac  the  search  diractad  by  aubgoala 

Fig.  4  shows  tbe  evoludoa  of  tbe  error  during  an  CBD  training  session.  Note  that  tbe  error  goes  up  at  tbe 
beginning  of  tbe  training  of  each  subgot!  but  tbe  error  does  not  arqmmlnie  from  a  subgoal  to  another. 

In  fig.  S  tbe  evolutioa  of  the  error  during  a  subgoal  training  is  piotted  against  tbe  number  of  epochs.  Both 
th.:  error  over  tbe  curreut  subgoal  and  the  error  over  tbe  previous  subgoid  are  plotted.  This  graphs  shows  that  even 
when  the  error  over  the  curreot  subgoal  training  set  is  large  due  to  tbe  presence  of  tbe  newly  atltM  patterns,  tbe  error 
over  tbe  previous  sub-goal  training  set  remains  small  which  shows  t^  the  search  takes  places  in  tbe  sub-space  of 
tbe  target  space  determined  by  tbe  previous  subgoal.  Tberefore,  in  this  case,  the  subgoal  manages  to  direct  tbe  search 
for  the  sdutloa. 

These  ctperimeots  show  tbe  role  of  tbe  pattern  presentatioa  algorithm.  Both  tbe  standard  training  and  tbe 
CBD  training  used  tbe  same  weight  change  algorithm.  However,  tbe  standard  training  fails  systematically  on  this 
problem  wbereas  the  CBD  can  be  successful.  However,  tbe  success  of  the  pattern  presentation  algorithm  itself 
depends  too  much  on  tbe  subgoal  definitioa  which  must  be  done  manually.  Although  in  this  case  there  was  an 
improvement,  tbe  CBD  paoern  presentatioa  algorithm  by  itself  does  not  guarantees  tbe  success. 

2.  Search  restricted  by  inbgoals.  CBD  training  algorithm  (architecture  and  pattern 

presentation) 

The  full  CBD  algorithm  has  been  tested  witb  linearly  inseparable  problems  containing  the  XOR  training 
set  An  example  is  presented  in  fig.  6.  Tbe  figure  contains  both  tbe  training  set  and  tbe  byperplanes  tbe  algorithm 
found  in  solving  tbe  problem. 

The  architecture  resulted  at  the  end  of  tbe  training  used  5  byperplanes  of  the  form  wlx■^w2y+wbias=0•  The 
sdutioa  is: 

Cl  =  A,  +  j^hfhj  +  h^kJ^^hyh^ 

C2  = 

Tbe  borizoatal  bar  means  tbe  sign  of  tbe  correspoodent  byperplane  is  minus  and  tbe  byperplanes  witb  sign 
«  nil  are  missing  from  tbe  expresskn  of  the  station. 

Tbe  solution  is  interpreted  in  tbe  following  way;  a  pattern  will  be  classified  as  Cl  if  it  determines  (a 
positive  activadon  of  tbe  neuron  associated  with  hQ)  or  (a  negative  activatK^  of  tbe  neuron  associated  with  hQ)  and 
(a  positive  activatioa  of  tbe  oeuroa  associated  with  hi)  and  (a  positive  activation  of  tbe  neuroa  associated  witb  h2) 
or...etc.  Logical  and  has  a  higher  priority  than  logical  or.  As  previously  described,  the  expressions  for  Cl  and  C2 
can  be  seen  as  a  reunioa  of  regkxis  obtained  by  intersecting  half-spaces  determined  by  different  byperplanes. 

Discussion 

Tbe  characteristics  of  the  CBD  architecture  and  training  algorithm  are: 

1.  The  CBD  training  algorithm  is  composed  by  a  weight  updating  algcritbm  for  a  single  layer  (delta  rule, 
for  instance),  tbe  (TBD  paoern  presentatioQ  algorithm  and  tbe  (ZBD  coostructioD  method. 

2.  CBD  has  the  abilities  of  a  multilayer  perceptron  but  the  training  is  performed  exclusively  in  subnets 
with  a  minimal  architecture  containing  only  one  layer  and  one  neuroa.  This  is  the  simplest  possible  training 
problem  from  tbe  point  of  view  of  the  architecture  (the  best  possible  situation  for  the  first  two  factors  in  section  1: 
only  ooe  layer  so  one  can  use  a  simple  weight  update  rule  and  only  one  neuroa  so  that  number  of  dimension  of  the 
weight  space  is  minimum). 

3.  CBD  trains  exclusively  training  sets  with  n  examples  of  which  n-1  are  already  correctly  classified.  This 
is  tbe  simplest  possible  training  problem  from  tbe  point  ot  view  of  tbe  training  set  and  this  eliminates  tbe  herd 
effect  Fortbennore,  all  tbe  training  sets  have  less  patterns  than  tbe  original  set  and  most  of  them  have  only  very 
few  potiems.  This  Is  a  redocticn  of  the  training  set's  dimensionality. 
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4.  CBD  Hods  ttfomatiolly  m  afddlectiac  able  lo  loivc  the  problem.  The  algorithm  giuraoiees  the  abreoce 
of  lueku  units  (whose  outputs  an  not  actually  used  in  perfonaing  the  clasiiflcatioii).  Although  the  architecture 
found  by  the  net  is  often  the  minimal  one,  the  algorithm  does  not  offer  guarantees  in  this  sense.  However,  the 
coDvergeoce  is  guaranteed 

5.  The  computation  involved  in  training  is  very  simple.  No  first  or  second  order  derivatives  arc  used.  No 
preprocessing  is  needed  The  tridning  is  very  &st  and  the  resulting  nerworic  is  able  to  solve  linearly  inseparable  tasks. 

6.  The  fact  that  the  first  hidden  layer  is  not  fully  connected  to  the  and  layer  avoids  the  interferenoe  between 
byperpiaoes  which  is  one  of  the  difficulty  faced  by  a  fully  coimecied  net 

7.  CBD  can  be  used  for  incremental  le^ing.  in  which  a  trained  network  is  asked  to  adapt  itself  at  new 
patterns.  The  CBD  net  will  train  only  the  smallest  possible  iegion<s)  of  the  input  space  which  contain  the  new 
panera(s)-  The  byperpiaoes  introduced  lo  satisfy  the  new  patterns  will  not  affect  the  classification  of  other  regkxu. 

8.  The  CBD  pattern  preseniatloa  algorithm  can  be  used  with  any  training  algorithm.  Combined  with  the 
standard  backpropagadon  weight  updating  algorithm  it  gives  an  improvement  over  the  standard  training  but  the 
results  ate  not  always  guaranieed 

y 

RelatloB  to  other  work. 

The  main  differences  between  CBD  and  older  training  algorithms  are  the  solutioos  to  the  training  problems 
brought  by  performing  the  ttaining  only  in  the  simplest  architecture  and  the  simplest  training  set 

Training  only  one  neuron  at  a  rimr  and  gradually  building  the  net  are  present  in  the  Cascade  Correlation 
(CC)  net  proposed  by  Falhmaa  and  Lebiere  in  [Falhman,  1990}.  However,  CC  algorithm  uses  the  whole  pattern  set 
and  the  resulting  architecture  is  different  CC  builds  feature  detectors  which  could  be  useful  in  some  problems. 
However,  fivtfaer  research  must  be  done  in  order  to  compare  the  performances  CBD  and  CC. 

The  idea  of  positioning  tiie  hyperplanes  in  the  right  places  is  present  in  several  other  techniques  such  as 
entropy  nets  and  query  learning.  The  entropy  nets  use  a  decision  tree  to  classify  the  regions  and  two  layers  one  for 
logical  and  and  one  for  logical  or.  These  layers  are  similar  to  those  use  by  CBD.  However,  the  building  of  the 
dedskn  tree  can  be  a  very  lengthy  process  because  it  involves  testing  very  many  candidate  questions  for  each  node  in 
the  tree.  For  instance,  the  CART  (Gassificatioa  and  Regiessioa  Trees)  uses  a  standard  set  of  candidate  questiotis  with 
one  candidate  test  value  between  each  pair  of  data  points.  At  each  node,  CART  searches  through  all  the  variables, 
finding  the  best  split  for  each.  Then  the  best  of  the  best  is  found  (see  [Breiman,  1984]).  This  can  be  a  very  time 
consuming  process. 

Quay  learning  is  mote  efficient  but  it  requires  the  existence  of  an  oracle  able  to  give  the  correct 
classificatioo  for  any  point  in  the  input  space.  In  the  usual  case  of  learning  frean  examples,  where  only  a  limited 
number  of  data  points  is  available,  tttis  is  not  possible.  The  query  learning  algorithm  is  presented  in  [Baum, 1990] 
and  [Baum,1991} . 

CBD  builds  up  the  I/O  surface  gradually,  one  region  after  another.  The  idea  trf'  locally  constructing  tbe  I/O 
shape  is  present  in  all  RBF  algorithms  (sec  (Moody.  1989],  [Musavi,  1992],  [Poggio,  1990]).  In  RBFs  case,  one 
unit  with  a  localised  activation  function  will  ensure  the  desired  response  for  a  small  region  of  the  I/O  space. 
However,  there  are  situations  in  which  a  hyperplane  net  is  better  than  an  RBF  net  Furtbennore,  for  an  RBF  net  to 
be  efficient  a  prqxocessing  stage  most  be  i^onned  and  parameters  like  radii  of  the  activation  functions,  their  shape 
and  orientatkn,  the  clustering,  etc.  must  be  calculated.  For  some  problems  tbe  sinq^licity  of  CBD  could  be  preferr^ 

Tbe  idea  of  building  tbe  solutiou  by  combining  partial  solutions  was  proixKed  by  Hinton  and  Anderson  in 
[Hinton,  1981].  However,  tbe  combining  method  proposed  there  is  a  simple  sum  of  the  weight  matrices  and  it  works 
only  for  orthogonal  pattenis.  This  can  be  seen  a$  a  particular  case  of  CBD  in  which  a  constraint  is  one  pattern.  In 
tiiis  special  case,  each  subspace  of  tbe  constraint  ^)ace  is  characterised  by  a  unique  weight  state  (a  partial  solution). 
Tbe  set  of  partial  solutioas  can  be  combined  to  give  a  unique  weight  state  which  satisfies  all  the  constraints  and 
therefore  is  tbe  solutioo.  In  constraint  space,  the  above  technique  is  equivalent  to  finding  the  subspaces 
correspooding  to  each  pattern  and  directiy  calculating  their  intersection  which  is  tbe  solution. 

There  are  few  algorithms  whi^  ensure  tbe  convergence  of  the  training  process.  Tbe  upstart  algorithm 
[Frean,  1990]  builds  a  bietardiical  structure  (which  can  be  eventually  reduced  to  a  2  layer  net)  by  starting  with  a  unit 
and  adding  daughter  units  which  cater  for  the  misclassifications  of  the  parents.  Sirat  and  Nadal  pix^xKed  a  similar 
algorithm  in  [Sirat,  1990]  However,  both  of  them  woric  for  on/off  units  only.  Mezard  and  Nadal  pr(^x>sed  a  tiling 
algorithm  which  starts  by  training  a  unit  on  tbe  whole  training  set  The  training  is  stopped  when  tbe  units  produces 
tbe  correct  target  on  as  many  units  as  possible.  This  pseudo-solution  weight  state  is  given  by  tbe  pocket  algorithm 
vriiicb  assnmes  that  if  tbe  p^km  is  not  lineariy  separable  the  algorithm  will  spend  most  of  its  time  in  a  region 
giving  the  fewest  erron.  The  podeet  algorithm  simply  monitors  the  weight  change  and  stops  tbe  training  after  srane 
chosen  time  L  It  is  very  inefficient  to  start  with  tbe  whole  training  set  because  most  of  the  time  tbe  training  will  fail 
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lad  tbe  pocket  aJiorithin  dots  oot  offer  any  lusnoteea  ttgardkii  Cbe  optimsiily  of  Che  weight  state  obuioed.  A  short 
descriptkn  of  these  iM^iquca  cao  be  IbuDd  in  [}lertz,  1991]. 

Romaoiuk  sod  Hail  [Romaaiuk,  1993]  proposed  a  divide  aad  conquer  net  wfakb  builds  up  the  network. 
The  Lr  divide  sod  coivjuerr  strategy  starts  with  one  neuron  sod  the  entire  training  set.  If  the  problem  is  linearly 
inseparable  (usually),  ihe  first  training  is  bound  to  fall  and  this  is  detected  by  a  time^NM  cooditloo.  In  comparison, 
CBD  starts  with  the  minimum  probkxn  which  is  guaranteed  to  have  a  loiutto.  The  divide  and  conquer  tedmique 
abo  requires  a  prc-processlDg  stage  is  which  the  nearest  neighbour  is  found  for  each  pattern  la  the  tralniog  sec  The 
archiiecbae  given  by  the  divide  and  conquer  algorithm  It  similar  lo  that  of  a  Cascade  CorreiatioD  network,  with  each 
emit  connected  to  all  the  input  units.  However,  the  architecture  of  the  OCN  network  depends  on  the  initial  weight 
state  which  am  be  incoovoikot  in  some  cases. 

The  eatentroo.  proposed  by  Baffes  and  Zelle  in  [Baffes,  1992]  builds  up  a  network  using  the  idea  that  a 
probiem  which  is  not  lineariy  separable  la  the  original  inpirt  space  becames  so  in  a  space  with  more  duneosioos.  A 
unit  is  used  to  teparaie  at  least  one  pattern  and  any  other  sub^ucnl  units  will  be  connected  to  it  as  well  as  to  all 
input  pattenu.  This  cascade  coo^: riion  means  that  for  highly  non-linear  problem  such  as  2-spirals,  the  last  few 
hidden  units,  will  have  to  solve  a  probleta  in  a  Ugly  dinveosionai  space  (2  dimensloQs  of  the  input  space  plus  n 
dimensloos  of  the  first  a  hidden  units).  Although  a  perceptroo  training,  the  training  can  be  more  difificult  because  of 
ihe  possible  large  number  of  dimensions. 
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Fig.  1.  A  history  based  training  (in  restriction  Fig.  2.  A  restriction  based  decomposition  (in 
space).  The  network  is  trained  with  intermediate  target  space).  Each  subgoal  asks  the  satisfactioo  of 
targets.  Each  intermediate  target  (sub-goal)  is  ooe  restriction  more  than  the  previous  sub-goal, 
characterised  by  the  same  number  of  testrictkxis  as  The  search  for  a  solution  of  a  sub-goal  is 
the  original  training  set  performed  in  a  sub-space  of  the  restrictioa  space. 
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Hg.  3.  Tbe  compiete  arcbittctore  of  t  CBD  network.  The  [^cture  shows  the  solutioo  of  the  problem  presented  in  flg. 
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RBD  training 


OO  O  O  O  tOcst^O 

O  CSIO  wo  W’O  too  o„  cuii 
•*-oj  tr  <0  to 


•pochs 

Rg.  4.  Each  peak  oonespoods  to  start  of  a  stibgoal  training.  Tbe  sudden  increase  in  error  is  due  to  tbe  new  patterns. 


Sub-goal  training 


Hg.  S  Tbe  evdatioa  of  tbe  error  during  a  subgoal  training.  Tbe  total  error  increases  as  tbe  new  patterns  are  added. 
However,  tbe  error  for  tbe  patterns  in  tbe  previous  subgoal  remains  fairly  small  even  if  it  becomes  greater  than  tbe 
error  limit  which  Is  3.  This  shows  that  tbe  training  path  remains  in  tbe  vicinity  of  the  subspace  of  the  restriction 
space  determiaed  by  the  previous  ttdigoal  i.e.  the  trad^g  is  directed  by  tbe  previous  subgoal. 
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Fig.  6.  As  exampk  of  a  I/O  set  Tite  architecture  obtained  at  the  end  of  the  irainisg  uses  5  bypeq>laDes. 


icpanite  ( r^km,  Cl>ae(  of  padenu  is  Cl,  C2>«et  of  pattenu  in  C2,  factor )  la 

•  Build  a  subgoal  S  with  pedenu  xi^^and  xi^  taken  at  random  from  Cl  and  C2.  Delete  xi^^asd 

xjC2  from  c j  god  02. 

•  Add  a  hidden  unit  and  train  it  to  sepaiaie  xi^^aod  xi^.  Let  h  be  the  byperplane  which  separates 

them. 

•  For  each  pattern  p  in  Cl  U  C2. 

•  AM  p  to  the  current  subgoal  S 

•  Says  b  in  hLCOpy 

•  Train  with  the  current  subgoal  S 
If  not  success  than 

•  Restore  h  from  h_copy 

•  Remove  p  from  S 

•  L«t  new_Cactar  «  factor  and  (b.'-f') 

•  If  the  positive  half-space  determined  by  new.factor  contains  only  patterns  in  the  same  class  Q 

then 

•  Qassiiy  new_factor  as  Q 

else 

•  Delete  frxxn  Cl  and  C2  all  the  patterns  which  are  not  in  Ih-.  Store  the  result  in  oew.Cl 
aDdncw_C2. 

•  S^iaraie<  bf.  ncwjactor,  new.Cl,  Dew_C2,  new.factor  ) 

•  VH  vewjaaot  >  factor  and  (b,'-') 

•  If  the  negative  half-space  determined  by  newjbctor  contains  only  padems  in  tbe  same  dass  Cj 

then 

•  Classify  ncwjactor  as  Cj 

else 

•  Delete  from  Cl  and  CZ  all  the  paoeres  which  are  not  in  b-.  Store  tbe  result  in  new_Cl  and 
new_C2. 

_ •  Separatef  b-,  new_factor.  nea-^Cl,  new_C2.  new_6actor ) _ 

Kg,  7  Tbe  CBD  algorithm  for  building  and  training  the  byperpiane  layer. 
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Abstract 

The  EBP  (error  back-propagation)  is  now  the  most  used  learning  algorithm  for  FFANNs  (feedforward 
artificial  neural  networks).  There  are  two  versions  of  the  EBP  algorithm  —  on-line  and  hatch.  The  on-line 
version  updates  weights  after  presentation  of  every  training  pattern.  On  the  other  hand,  the  batch  version 
of  the  EBP  algorithm  accnmnlates  weight-corrections  for  all  the  training  patterns,  and  then  updates  all 
weights.  Each  version  has  its  advantages  and  disadvantages  when  compared  on  the  basis  of  learning  time 
and  convergence  rate.  In  this  paper,  we  propose  a  method  for  integrating  on-line  and  batch  versions  of  the 
EBP  algorithm  into  one  algorithm.  We  also  propose  and  study  three  approximate  implementations  of  the 
integrated  EBP  algorithm.  Our  simulation  study  of  the  integrated  EBP  algorithm  for  the  XOR  problem 
has  shown  smaller  learning  time  and  better  convergence  rate  than  those  of  the  Quickprop  algorithm 
proposed  by  Fahlman.  All  three  approximate  implementations  of  the  integrated  EBP  algorithm  for  the 
binary  version  of  the  Majority-XOR  problem  have  shown  very  favorable  performances  in  comparison  with 
extensive  study  data  available  in  the  literature. 


1  Introduction 

Artificial  Neural  Networks  (ANNs)  are  mathematical  models  developed  to  mimic  certain  information  storing 
and  processing  capabilities  of  the  brain  of  higher  animals.  Although  the  interest  of  the  research  community 
in  ANNs  as  a  means  for  intelligent  computing  had  existed  for  over  30  years  (see  [WL90]),  there  is  little  doubt 
that  Rumelhart,  McClelland,  and  the  PDP  research  group  are  credited  with  revitalizing  of  wide  interest  in 
it  [RM'^86].  The  different  models  and  their  applications  can  be  found  in  many  books  and  in  such  surveys 
as  [Hin89,  Lip87,  WL90].  This  paper  concentrates  only  on  Feedforward  ANNs  (FFANNs)  and  Error  Back 
Propagation  (EBP)  learning  algorithms  for  them.  Basic  elements  of  the  theory,  as  pointed  out  by  le  Cun 
[1C88],  ciui  be  traced  back  to  the  book  of  Bryson  and  Ho  [BH69].  It  was  more  explicitly  stated  by  Werbos 
[Wer74],  Parker  (Par85],  le  Cun  (1C86],  and  Rumelhart-Hinton- Williams  [RHW86].  The  EBP  is  now  the 
most  popular  learning  algorithm  for  multilayer  FFANNs,  because  of  its  simplicity,  because  of  its  power  to 
extract  useful  information  from  examples,  and  because  of  its  capability  of  storing  information  implicitly  in 
the  connecting  links  in  the  form  of  weights. 

Despite  its  power,  the  original  version  of  the  EBP  learning  algorithm  has  been  of  great  concern  to  practical 
users  for  many  reasons:  t)  it  is  extremely  slow  if  does  converge,  it)  it  may  get  stuck  in  local  minima  before 
learning  all  the  examples,  tit)  it  is  sensitive  to  initial  conditions,  and  tv)  it  may  start  oscillating  etc.  Several 
methods  have  been  proposed  to  improve  the  performance  of  EBP  algorithm.  Important  methods  to  speedup 
the  EBP  algorithm  have  been  surveyed  in  [Sar92],  and  are  very  briefly  discussed  next. 

Efforts  to  Speedup  the  EBP  Algorithm  Rumelhart,  Hinton,  and  Williams  [RHW86]  have  insightfully 
argued  that  a  relatively  smaller  learning  rate  coefficient  makes  learning  slow,  but  too  large  a  value  causes 
oscillation  preventing  the  network  from  learning  the  task,  and  have  suggested  adding  a  momentum  term  in 
the  weight  updating  rule  which  dynamically  increases  or  decreases  the  effective  value  of  the  learning  rate 
coefficients  depending  on  the  nature  of  the  energy  surface.  Further  analysis  and  experimental  study  on  the 
effect  of  momentum  coefficient  on  improvement  of  learning  can  be  found  in  [Bat92,  E092,  Jac88,  Tol90, 
Wat88].  Although  a  limited  effective  dynamic  range  of  learning  rate  coefficient  is  obtained  by  adding  a 
momentum  term,  in  most  practical  cases  it  is  not  good  enough  to  cover  requirements  of  ail  types  of  energy 
surfaces  that  may  have  a  wide  range  of  gradient  values,  and  hence  various  methods  have  been  reported  to 
directly  adapt  learning  rate  coefficient  [Bat89,  D088,  Jac88,  PS91 ,  Tol90,  Wei91].  Some  of  these  methods  keep 
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one  learning  coefficient  for  each  weight  [D088,  Jac88,  Tol90],  and  others  keep  only  one  learning  rate  coefficient 
for  all  weights  [Bat89,  PS91,  Wei91].  Conjugate  gradient  method  or  some  approximations  to  it,  when  used 
with  the  EBP  algorithm,  have  shown  considerable  improvement  in  learning  speed  [Bat89,  Bat92,  Kin92].  Yet 
another  method  for  fast  learning  with  the  EBP  algorithm  is  to  use  new  (other  than  standard  sum-of-the- 
squared  error)  energy  functions  [E092,  AS92].  The  size  of  the  learning  set  affects  the  learning  rate  and  could 
be  considered  in  selecting  learning  rate  coefficient  [vON92].  Other  suggested  methods  for  improving  learning 
rate  include  rescaling  of  error  at  every  layer  [RIV91],  and  using  expected  outputs  instead  of  actual  outputs 
to  compute  weight  correction  [Sam91].  In  the  next  section,  following  [RHW86]  the  original  version  of  EBP 
algorithm  is  presented. 


2  Original  EBP  Algorithm 


Error  back-propagation  (EBP)  learning  rule  (popularly  known  as  back-propagation  algorithm)  which  is  also 
known  as  the  Generalized  Delta  Rule  (GDR)  was  proposed  by  Rummelhart,  Hinton,  and  Williams  in  their 
seminal  work  [RHW86].  In  the  EBP  learning  algorithm,  following  the  presentation  of  an  input  vector  Xm 
and  a  target  vector  Tm  (for  m  =  1...P),  the  rule  for  updating  weight  Wj^  of  the  link  connecting  ifcth  node  in 
a  layer  f  to  the  jth  node  in  the  subsequent  layer  /  +  1  is  given  by; 

Au)]*  = -»j  X  (1) 

where  is  a  constant  known  as  the  learning  rate  coefficient  and  is  the  partial  derivative  with  respect 

to  wji^  and  Ei  is  the  energy  function.  Rumelhart,  Hinton,  and  Williams  [RHW86]  in  their  original  EBP 
algorithm  used  the  sum-of-the-squared  error  as  the  energy  function. 

=  5  E  (2) 

m=l  n=l 

where  n,  is  the  number  of  units  in  the  output  layer,  tm„  and  ymn  nre  target  output  and  actual  output, 
respectively.  The  energy  function,  Ep  can  be  defined  for  only  one  training  pattern  pair  {xp,dp)  as  follows. 


(3) 


There  are  two  versions  of  the  EBP  algorithm,  on-line  and  batch.  In  the  on-line  EBP  algorithm,  the  weights 
are  updated  using  the  error  corresponding  to  every  training  pattern.  This  method  uses  energy  function 
defined  by  equation  3.  However,  in  the  batch  EBP  algorithm,  the  weights  are  updated  after  accumulating 
errors  corresponding  to  all  input  patterns,  and  thus  makes  use  of  the  energy  function  defined  by  equation  2. 

Values  of  several  parameters  are  of  importance  for  implementation.  The  initial  value  of  weights  should  be 
small  and  randomly  chosen  (RHW86]  to  avoid  the  symmetry  problem.  The  tj  value  plays  a  very  important 
role.  A  smaller  value  of  rj  makes  learning  slow,  but  too  large  a  value  will  cause  oscillation  preventing  the 
network  from  learning  the  task  [RHW86].  In  practice,  most  effective  value  of  rj  depends  on  the  problem.  For 
example,  Fahlman  [Fah88]  has  reported  0.9  as  value  of  learning  rate  coefficient  for  one  problem,  while  Hinton 
[HinST]  has  reported  0.002  as  the  value  of  learning  rate  coefficient  for  another  problem.  This  variation  in  the 
value  of  learning  rate  coefficient  for  faster  training  of  FFANNs  has  drawn  the  attention  of  many  researchers. 


3  Integration  of  On-Line  and  Batch  EBP  Algorithms 

In  this  section,  we  propose  a  method  for  integrating  on-line  and  batch  versions  of  the  EBP  algorithm.  The 
integrated  algorithm  attempts  to  take  best  features  of  both  on-line  and  batch  EBP  algorithms.  We  believe,  it 
would  increase  convergence  rate  and  reduce  learning  time.  Since,  an  exact  implementation  of  this  novel  version 
of  the  EBP  algorithm  requires  considerably  higher  computation  time,  several  methods  for  its  approximate 
implementation  are  also  proposed.  Before  proceeding  further  some  notations  are  introduced  to  make  the 
presentation  concise  and  precise.  Since  the  same  method  will  apply  to  all  the  weights,  without  causing  any 
confusion  we  drop  the  subscripts  and  superscripts  from  Aisji^. 
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The  on-line  version  of  the  EBP  algorithm  updates  weights  after  presentation  of  each  pattern,  and  hence 
if  two  patterns  have  weight  corrections  opposite  in  sign,  they  tend  to  cancel  each  other’s  effort.  The  batch 
version  of  the  EBP  algorithm  avoids  this  problem  by  accumulating  weight-corrections  for  all  patterns  and 
then  making  only  one  update  at  the  end.  This,  however,  may  risk  the  possibility  where  the  sum  of  all  weight 
corrections  is  sero  but  the  network  has  not  learnt  many  examples.  Under  this  situation  the  on-line  algorithm 
may  be  of  use  since  weight  correction  for  one  pattern  at  a  time  reduce  error  for  that  pattern. 

The  proposed  algorithm  defines  energy  function  by  combining  energy  functions  of  on-line  and  hatch  EBP 
algorithms.  A  new  energy  function  £/,  for  the  novel  EBP  is  as  follows: 

Ei,  =  0xE^  +  (4) 

where  0  ia  a  positive  constant.  For  a  given  pattern  p,  the  weight  updating  rule  is  given  by; 

Aw  = -ff  X  (0  X  Au,  Ep  + -^Au,  Eta)  (5) 

Thus,  for  updating  weights  for  a  input  pattern,  it  is  necessary  to  compute  weight  correction  for  all  the 
patterns.  This  integrated  energy  function  in  computing  weight  corrections  considers  the  effect  of  all  the 
patterns.  Thus,  for  example,  if  overall  weight  correction  for  a  weight  w  due  to  all  the  patterns  is  negative, 
but  weight  correction  due  to  a  pattern  p  is  positive,  then  the  net  weight  correction  will  be  lower,  and  vice 
versa.  This  gives  the  algorithm  some  sense  of  fairness.  The  amount  of  learning  for  a  pattern  is  reduced 
if  it  has  adverse  effect  on  the  whole  set  of  patterns.  On  the  other  hand,  the  amount  of  weight  correction 
for  a  given  pattern  is  increased  if  it  has  a  favorable  advantage  on  the  whole  set  of  patterns.  Yet  another 
way  to  see  the  effect  of  the  new  learning  algorithm  is  that  it  increases  the  dynamic  range  of  learning-rate 
coefficient.  When  the  value  of  AwEp  and  AEta  are  opposite  in  sign,  the  effective  learning  rate  is  lower; 
but  when  Aa,Ep  and  AEta  are  same  in  sign,  the  effective  leamin-  '^te  is  higher.  Next,  three  methods  for 
approximate  implementation  of  the  proposed  algorithm  is  presentei. 

4  Three  Cost-Effective  Approximate  Implementations 

Average  of  the  Last  Cycle  (ALC)  In  this  method  weight  corrections  for  each  weight  is  accumulated  for 
all  the  patterns  to  be  taught  during  the  last  cycle.  The  average  of  this  accumulated  correction  is  divided  by 
the  number  of  patterns  to  obtain  approximate  value  of  pA^Eta-  A  Pascal-like  description  of  this  is  given 
next. 

average-del-weight  {approximation  of  pA^Eta  for  the  next  cycle  }  :=  0; 

while  not  trained  do 

begin 

cum-del-weight  (for  a  weight  tu}  ;=  0; 
for  i  :=  1  to  P  {  number  of  patterns  }  do 

begin  cum-del-weight  :=  cum-del-weight  Aw;  end 
average-del-weight  (approximation  of  pAu^Eta  for  the  next  cycle  )  :=  cum-del-wcight/P; 

end 


Our  experience  is  that  this  very  simple  to  implement  method  works  well  when  the  number  of  patterns  to  be 
taught  is  not  ‘too  large’. 

Modified  Average  of  the  Last  Cycle  (MALC)  In  this  method,  at  the  beginning  of  each  cycle  p  Aw  Eta 
is  the  average  value  of  weight  corrections  for  all  the  patterns  during  the  last  cycle.  For  subsequent  patterns 
jtAwEta  is  approximated  by  subtracting  ^th  of  the  current  value  and  then  adding  ^tb  of  the  Aw  for  the 
last  pattern.  A  Pascal-like  description  of  this  procedure  is  as  follows: 

approx-average-del-weight  (approximation  of  p  Aw  Eta  at  the  beginning  of  the  next  cycle  }  :=  0; 
while  not  trained  do 
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begin 


end 


cum-del-weight  {for  a  weight  w}  ;=  0; 
for  t  :=  1  to  P  {  number  of  patterns  }  do 
begin 

cum-del-weight  :=  cum-del-weight  -h  Aw, 

approx-average-del-weight  {approximation  of  pAu,  Eta  for  the  next  pattern} 

:=  (1  -  1/P)  *  approx-average-del-weight  +  (Au»)/P; 

end 

approx-average-del-weight  {approximation  of  j/AmEu  at  the  beginning  of  the  next  cycle  } 

;=  cum-del-weight/P; 


This  method  requires  additional  computation  for  updating  the  approximated  value  of  p  Aw  Eta  after  presenta¬ 
tion  of  each  pattern.  However,  when  the  number  of  patterns  to  be  taught  is  large,  this  approxin'  *  '  on  method 
might  reduce  learning  time  to  a  great  extent.  Thus,  the  additional  computation  is  justifier  aching  a 
large  number  of  patterns  to  a  network. 


Weighted  Average  of  Last  and  Current  Cycle  (WALCC)  In  this  method  the  approximation  of 
pAwEta  after  presentation  of  t  patterns  is  given  by  (P  —  t)/P  times  the  average  weight  correction  for  the 
last  cycle  plus  i/P  times  the  average  weight  correction  for  the  i  patterns  in  the  current  cycle.  A  Pascal-like 
description  of  this  method  is  shown  next. 


approx-average-del-weight  {approximation  of  pAw  Eta  for  the  next  pattern  }  :=  0; 

while  not  trained  do 

begin 

cum-del-weight  {for  a  weight  w}  :=  0; 
for  t  :=  1  to  P  {  number  of  patterns  }  do 
bepn 


end 


cum-del-weight  :=  cum-del-weight  +  Aw, 

approx-average-del-weight  {approximation  of  p  Aw  Eta  for  the  next  pattern) 
;=  (1  -  i/P)  *  average-cum-del-weight  +  (cum-del-weight/t)  ♦  (i/P); 

end 

average-del-weight  {approximation  of  p  Aw  Eta  at  the  beginning  of  the  next  cycle  } 
:=  cum-del-weight/P; 


In  the  next  section  simulation  results  for  approximate  implementations  of  the  integrated  EBP  algorithm  is 
reported. 

5  Simulation  Results  and  Discussion 

The  performance  of  the  algorithms  presented  in  the  earlier  section  was  studied  through  simulation.  We 
compare  the  performance  of  our  algorithms  with  the  data  in  [Fah88,  CT91}.  Since  the  Quickprop  algorithm 
in  [Fah88]  has  shown  dramatic  improvements  and  benchmark  data  are  also  available,  we  first  compared  our 
results  with  it. 

The  data  in  Table  1  for  Standard  EBP  and  ALC  implementation  of  Integrated  EBP  algorithms  are  for  10 
different  initialization  of  the  network.  In  three  instances  the  Standard  EBP  learning  algorithm  failed  to  stop. 
We  excluded  them  in  our  average  computation.  The  data  for  the  Quickprop  algorithm  is  for  100  trials  [Fah88]. 
As  can  be  seen  from  Table  1,  Integrated  algorithm  outperformed  both  the  standard  EBP  with  momentum 
and  the  Quickprop  algorithms  in  learning  speed  and  convergence  rate. 
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algorithm 

learning 

rate 

momentum 
or  $ 

restart 

maximum 

epochs 

minimum 

epochs 

average 

epochs 

Standard  EBP 

5.5 

1.2 

3 

214 

82 

140.9 

Quickprop  [Fah88] 

4.0 

l.O 

14 

66 

10 

24.22 

Integrated  EBP 

1.2 

0  =  0.83 

0 

34 

12 

19.4 

Table  1  Comparison  of  Performance  of  Integrated  EBP  Algorithm  with 
That  of  Quickprop  and  Standard  EBP  Algorithms 


In  [CT91]  it  was  reported  that  for  Majority-XOR  problem  (see  description  next)  the  standard  EBP 
algorithm  with  momentum  converged  only  86.84%  of  the  times  with  a  cutoff  of  50,000  epochs.  Thus,  we 
believe  that  it  would  be  a  good  problem  to  compare  our  algorithms’  convergence  with  their  results.  Next  we 
briefly  describe  the  Majority-XOR  and  then  the  simulation  results. 

Majority-XOR  (M-XOR)  This  is  one  of  the  problems  Cohn  and  Tesauro  used  to  see  ‘can  neural  networks 
do  better  than  Vapnik-Chervonenkis  bounds?’  [CT91].  It  is  an  extension  of  the  linearly  separable  majority 
function.  Majority  is  a  Boolean  predicate  in  which  the  output  is  ‘1’  if  and  only  if  more  than  half  of  the  bits 
are  ‘1’.  Majority-XOR  is  a  Boolean  function  of  N  bits  where  output  of  the  function  is  ‘1’  if  and  only  if  Nth 
bit  disagrees  with  the  majority  of  the  first  Af  —  1  bits.  Following  [CT91],  in  our  study  the  input  was  26-bit 
binary  patterns  and  output  was  one-bit  binary  value.  The  network  had  three  hidden  units  and  presented  600 
patterns  until  it  learnt  all  or  400  epochs  expired.  Table  2  summarizes  empirical  observations  from  simulations. 


implementation 

learning 

rate 

0 

restart 

maximum 

epochs 

minimum 

epochs 

average 

epochs 

ALC 

0.125 

2.4 

0 

393 

145 

207.45 

MALC 

0.125 

2.4 

1 

259 

134 

201.11 

WALCC 

0.125 

2.4 

1 

324 

136 

206.58 

Table  2  Comparison  of  Performance  of  Three  Approximate 
Implementations  of  the  Integrated  EBP  Algorithm 


With  20  different  initializations  each  of  the  three  implementations  was  tested.  As  can  be  seen  from  Table  2, 
both  M  ALC  and  WALCC  implementations  fuled  to  stop  only  one  in  400  epochs  out  of  20  trials.  If  we  consider 
all  60  trials  of  the  three  implementation,  it  turns  out  that  96.7%  of  the  times  the  integrated  EBP  algorithm 
converged  even  with  a  cutoff  of  400  epochs.  This  is  a  significant  improvement  over  the  study  reported  in 
[CT91],  where  it  was  reported  that  only  86.84%  of  the  times  Standard  EBP  algorithm  with  momentum  terms 
converged  with  a  cutoff  of  50,000  epochs. 
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Abstract 

This  paper  presents  a  raodificadon  to  the  backpropagation  algorithm  which 
improves  network  devcloprrient  time  in  two  areas.  The  modification  reduces  the 
number  of  times  the  network  convergence  settles  into  a  local  minima  during 
training.  It  simultaneously  speeds  convergence.  Experimental  results  are  given 
which  demonstrate  the  improvement  gained  in  both  of  these  areas;  speed  of 
convergence  is  3  to  5  times  faster  than  with  the  standard  BPN  algorithm  and  for 
many  of  the  experimental  cases,  local  minima  are  almost  totally  eliminated. 

1.  Introduction 

The  backpropagation  nciral  r^twork  has  been  used  for  both  classification  problems  and 
for  generalization  problems.  This  paper  investigates  the  problem  of  local  minima  in  the  context 
of  generalization  problems,  although  the  results  obtained  should  be  equally  applicable  to 
classification  problems.  (For  a  background  on  the  use  of  Neural  Networks  in  generalization 
problems,  see  [Shekhar].) 

A  neural  network  “Icams”  a  function  from  a  set  of  input/output  pairs  representing  the 
function.  As  the  learning  algorithm  progresses,  it  is  possible  that  the  network  learns  a  less  than 
optimal  function  as  its  best  approximation  to  the  function  over  the  full  domain  for  which  tiie 
network  is  developed.  At  a  local  minimum,  the  algorithm  makes  no  fiuther  progress  on 
approximating  the  function  since  any  small  change  to  the  weights  of  the  network  will  increase 
the  error  found  by  applying  the  training  patterns  to  the  network.  There  has  been  considerable 
discussion  on  the  problem  of  local  minima  and  methods  to  remove  them,  e.g.  Paul  Werbos’s 
presentation  at  WCNN,  ’93  [Werbos.]  Some  methods  have  proved  to  be  quite  effective  in 
removing  local  minima  but  are  too  costly  in  terms  of  convergence  speed.  The  Boltzman  machine 
is  one  such  attempt  to  avoid  local  minima  but  often  has  unacccjHably  long  convergence  time. 

For  a  discussion  of  Boltzman  machines,  see  [Ackley]. 

Others  have  examined  how  the  number  of  training  padems  or  the  number  of  hidden 
layers  in  a  network  effect  local  minima.  In  sense  cases  dealing  with  function  classification 
networks,  increasing  the  number  of  hidden  layers  increases  the  number  of  local  minima 
[Perugini].  In  the  generalization  networks  examined  in  this  paper,  this  docs  not  seem  to  be  slie 
case.  However,  the  complexity  of  the  fimction  to  be  generalized  is  a  factor. 

The  convergence  time  which  I  investigate  is  simply  a  coimt  of  the  number  of  training 
cycles  needed  before  the  network’s  error  on  the  training  set  is  at  some  minimally  acceptable 
level.  Each  cycle  consists  of  the  application  of  all  the  input  paKerits  and  of  the  weight 
adjustments  after  each  pattern. 
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2.  Algoriltm  MudificaSJaa 

The  ba:kpropagahon  algorii5iin  is  imdoubCedly  the  most  commonly  used  neural  network 
iinpleinentation  today.  The  description  of  the  algorithm  can  be  found  in  many  books  and  articles 
IRurnclhart,  or  in  Hecht-Nielscn]. 

The  architecture  for  this  paper  consisted  of  a  three-layer  network.  The  transfer  function 
for  the  nodes  of  the  hidden  layer  was  the  sigmoid  transfer  function,  1/  ( 1  e  **  ),  where  x  is  the 
weiglited  sum  of  tlic  inputs  to  the  nodes  plus  the  bias  term.  Then  output  nodes  simply  yielded  an 
alTine  combinadoo  of  tl^  hidden  nodes  including  the  output  node’s  bias. 

The  usiud  BPN  algorithm  is  a  gradient  descent  on  the  error  surface,  which  is  a  manifold 
over  the  “weight  space”  of  the  network.  Any  given  weight  configuration  fixes  the  average  error 
of  the  network  over  sli  testing  or  training  patterns.  The  backpropagadon  algorithm  describes  a 
method  to  determine  at  each  node  the  partial  derivative  of  the  error  value  with  respect  to  each 
weight  at  that  node.  A  small  naodlficatioo  of  the  weight  is  then  made  in  the  direction  that  will 
minimize  the  error.  Local  minima  on  this  weight  surface  arc  one  of  the  bancs  of  the  technique. 

Ihe  algorithm  described  in  this  paper  avoids  local  minima  by  changing  the  weight 
surface  itself.  Fust,  detine  the  transfer  function  for  a  hidden  node  to  be: 


s(  I,  node_maj:,  node_min )  = 


(node_inax  -  node_nun)  . 

- = - = - —  +node_inin 

(1+c-*) 


Note  th^  a  change  in  •■yxJe_max  or  nod8_mln  docs  not  simply  change  the  position  on  the  error 
surface  for  a  given  inpui  or  set  of  inputs  —  it  changes  the  shape  of  the  error  surface.  Changes  in 
these  parameters  have  a  maikedly  different  type  of  effect  on  the  output  of  a  node  for  a  given 
input  vector  than  those  changes  which  occur  when  either  an  input  weight  is  changed  or  the  bias 
value  is  changed. 

The  actual  clmnge  to  the  backpropagadon  algorithm  is  given  as  follows: 

«  Define  the  transfer  function  for  each  sigmoid  as  in  the  equation  above  where  each  node 
keeps  ncdo_max  and  nodG_m!n  in  its  local  menxiry.  At  initialization,  these  are  set  to  I 
and  0  reflectively. 

•  On  learning,  if  sum  is  the  accumulated,  weighted,  backpropagated  errors  to  a  node,  then  the 
node_max  is  updated  by  the  fonnula: 

node_max  +=  r>ock>_max_cloJ!a = ti  *  sum  ♦  logistic  +  a  ♦  node_max_d6lta 
where  q  is  the  kaming  rate,  a  is  the  momentum  rate  and 

logistic  =  1/  (1  +  c  *  ^)  with  1  being  the  usual  weighted  input  to  the  node. 

«  Node.min  is  similarly  updated  by  the  formula: 

nod9_min  +=  nod0_fnin_d8tta  =  q  •  sum  •  (1  -  logistic)  +  a  *  nod0_min_delta. 


3.  Test  Cases 

Cunently,  the  efficacy  of  the  algorithm  has  been  verified  on  a  number  of  generalization 
problems  for  mappings  from  — >  R^.  Runs  were  made  on  linear,  quadratic  and  cubic 

polynomial  equations  and  also  on  a  Bessel  function.  The  Mathematica  plots  at  the  top  of  the 
next  page  show  th^  functions  that  were  used  for  training  the  netwoiks  for  this  paper. 
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I  used  15  •  BesseU[0^]  so  the  range  of  the  two  functions  is  similar  thus  facilitating  comparison 
of  results. 

For  each  fimction,  the  training  set  consists  of  the  functional  values  for  the  integers  in  the 
domain  of  the  functions.  Thus  there  were  7  training  points  for  the  cubic  function  and  1 1  training 
points  for  the  Bessel  function. 

4.  Result:  Effect  on  Local  Minima 

Look  at  the  cubic  function  first.  The  following  table  sununarizes  the  number  of  local 
minima  for  various  architectures  with  the  two  contrasting  algorithms: 


Number  of  Local  Minima:  Cubic  Function 


Hidden  Nodes 

#  tests 

“Standard”  BPN 

"Modified”  BPN 

3 

200 

200 

9 

5 

200 

113 

1 

7 

200 

7 

0 

9 

200 

1 

0 

11 

200 

0 

0 

Although  there  are  several  local  minima  configurations  for  the  network,  the  following 
graph  shows  a  typical  output  from  a  network  that  is  in  a  local  minima.  (In  this  case  the  max  and 
min  on  all  nodes  was  set  to  1  and  0,  respectively.) 


1II-563 


Plot[netoutput[Detweights.  hidiiiaxmin[[i]].  outinaxmin[[i]].{x}], 
{x.  -3, 3},  WotRangc  ->  AU,  PlotLabel  ->  "Plot"  (i/2)] 


Plot 


Local  Minima  for  Network  Learning  Cubic  Functbm 

Sinulariy,  there  are  a  nomber  of  local  nunima  configurations  when  the  network  is  trying 
to  approximate  the  Bessel  fimctioo.  One  of  these  configurations  is: 


Local  Minima  fw  Network  Learning  Besael  Fimctkm 

The  following  table  shows  the  inqnoveinent  made  in  the  number  of  local  minima  in 
convergence  to  the  Bessel  fimction  in  various  network  architectures.  The  modified  BPN 
algorithm  shows  markedly  better  convergence. 


Number  of  Local  Minima:  Bessel  Fnnctloa 


Hidden  Nodes 

#tests 

“Standard”  BPN 

“Modified”  BPN 

3 

100 

100 

27 

5 

100 

100 

8 

7 

100 

93 

7 

9 

100 

63 

7 

11 

100 

42 

1 

Not  only  does  the  modified  BPN  algorithm  decrease  the  number  of  local  minima,  it 
simultaneously  reduces  the  number  of  cycles  necessary  for  convergence.  This  is  desCTibed  in  the 
next  section. 
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5.  Result:  Effect  on  Speed  of  CooTergence 

I  checked  the  spe<^  of  conveigence  by  ninning  the  network  up  to  10,000  cycles,  where 
one  cycle  was  a  presentation  of  all  the  training  patterns.  The  network  was  said  to  converge 
when,  for  a  cycle,  the  maximum  absolute  error  over  all  training  patterns  was  less  »h*n  0.5.  The 
plots  below  for  100  runs  show  the  number  of  cycles  it  took  for  the  network  to  converge  by  this 
ciitenon.  Note  that  the  network  was  in  a  local  minimum  for  most  (but  not  all)  of  the  cases  where 
it  had  not  converged  in  1 0.000  cycles. 


Test  runs  with  5  hidden  nodes;  Cubic  Function;  Standard  BPN 


As  the  nodes  in  the  hidden  layer  increased,  the  modified  BPN  algorithm  iiKxe 
consistently  gave  convergence  times  in  the  1900  to  2000  cycles  range.  With  1 1  hidden  nodes, 
the  convergence  statistics  on  the  number  of  cycles  it  took  to  omverge  were: 

Mean  =  2034,  Median  =  1954,  Standard  deviation  =  338.5 

The  standard  BPN  algrxithm  continued  to  inqxove  with  more  hidden  nodes,  but  even  for  1 1 
hiddra  iKxles,  the  results  were  not  that  good  The  statistics  in  this  case  are: 

Mean  =  6533,  Median  =  6546,  Standard  deviation  =  970.5 

The  chart  on  the  next  page  shows  graphically  the  poor  convergence  results  even  in  this  case. 
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Test  runs  with  11  hidden  nodes;  Cubic  Function;  Standard  BPN 


The  results  on  the  Bessel  function  are  not  final  but  arc  veiy  encouraging.  The  limit  of 
10,000  cycles  allowed  for  convergence  proved  to  be  too  stringent.  The  best  convergence  was 
achieved  with  the  1 1-hidden  node  networks.  Althou^  it  was  clear  that  at  least  58  of  the  runs 
with  the  standard  BPN  algorithm  were  converging  to  a  global  minimum,  after  10,000  cycles  not 
one  of  the  100  runs  met  the  criterion  of  having  an  error  of  less  than  0.5  on  all  training  patterns. 
The  increased  speed  of  the  modified  algorithm  is  seen  by  looking  at  the  following  graph  showing 
cycles  at  which  that  algorithm  did  achieve  the  convergence  criterion.  On  the  100  tuns,  the 
median  time  to  converge  was  9036  cycles.  The  mean  time  of  8580.7  is  meaningless  because  of 
the  number  of  trials  cut  off  after  10,000  cycles.  Likewise,  the  standard  deviation  of  1646.2  does 
not  tell  much;  it  would  be  somewhat  larger  if  we  upped  the  cutoff  limit 

Test  runa  with  11  hidden  nodes;  Bessd  Function;  Modified  BPN 


6.  Condnslons  and  Disaisslon 

This  easily  implemented  modification  to  the  baclqxx^Mgation  algorithm  shows  much 
promise  in  avoiding  loc?!  minima  and  simultaneously  speeding  the  rate  of  convergence  during 

neural  netwoik  training.  It  is  recognized  that  the  mappings  that  were  tested  here  are 

relatively  8in^>le.  Work  is  now  in  progress  to  verify  these  results  with  actual  problem  data.  At 
UAF  we  have  in]q)lemented  a  neural  netwoik  to  aid  lightening  forecasting  in  Alaska  and  had 
problems  getting  convergence  and  avoiding  local  minima.  We  are  starting  to  test  this  algorithm 
on  a  mote  extensive  system  which  has  16  inputs,  30  to  60  hidden  nodes  and  one  output  and  is  to 
be  trained  on  li^itening  data.  Because  the  rdgorithm  is  based  on  the  fundamental  i(^  of  avoid¬ 
ing  local  minima  by  modifying  the  error  surface  instead  of  simply  finding  a  path  on  the  error 
surface  to  a  gli^  minima,  we  believe  that  diese  results  are  extensible  to  much  larger  systems. 
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Abstract 

A  system  matrix  is  introduced  through  a  reformulation  of  the  backpropagation  training  algorithm.  The 
condition  number  of  this  matrix  is  a  good  indicator  of  training  convergence.  The  structure  of  the  system 
matrix  can  also  be  exploited  to  accelerate  convergence.  This  is  illustrated  via  the  addition  of  derivative  noise, 
the  use  of  a  cross  entropy  cost  function  and  the  addition  of  hidden  units. 

1  Introduction 

Backpropagation  of  errors  [12,  8]  is  the  most  popular  training  algorithm  for  feedforward  artificial  neural 
networks  [3].  Its  popularity  is  a  result  of  its  simplicity  and  easy  implementation  on  digital  computers. 
A  major  disadvantage  of  backpropagation  is  its  slow  convergence  [7].  Presently,  there  exists  a  myriad  of 
algorithms  which  increase  the  convergence  rate  of  backpropagation  [10,  1,  9,  11,  2].  However,  many  are  based 
on  ad  hoc  modifications  which  perform  well  under  simulation  of  specific  examples,  but  offer  little  in  the  way 
of  analysis.  To  design  algorithms  which  have  better  convergence  rates  it  is  important  to  understand  the 
factors  underlying  training  dynamics. 

This  paper  reformulates  the  backpropagation  training  algorithm  into  a  linear  algebraic  framework.  As 
a  result,  analysis  of  training  dynamics  is  simpler  and  linear  techniques  can  be  used  to  predict  algorithm 
performance. 

In  section  2  the  system  matnz,  A(w),  is  defined.  It  represents  the  core  of  this  analysis.  In  section  3  the 
use  of  the  condition  number  of  a  matrix  as  a  related  notion  of  linear  independence  is  discussed.  The  condition 
number  of  the  system  matrix  is  used  to  quantify  the  trmning  process.  Simulation  results  in  section  4  iUustrate 
the  use  of  condition  number  as  a  convergence  indicator. 

In  section  5,  the  matrix  formulation  is  used  to  predict  the  performance  of  various  algorithms.  These 
include  the  use  of  derivative  noise  [2]  and  the  use  of  the  cross  entropy  cost  function  [10,  11].  The  addition  of 
hidden  units  is  also- analysed  for  convergence  performance. 

2  Construction  of  the  System  Matrix,  A(w) 

This  section  outlines  the  construction  of  the  system  matrix  from  the  basic  backpropagation  of  errors  [8] 
weight  update  formulae. 

Consider  a  multilayer  perceptron  with  I  inputs,  H  hidden  units  and  one  output  unit.  The  training 
set  consists  of  P  patterns  and  the  training  cost  function  is  chosen  to  be  the  sum  of  squared  error,  i.e. 
S(w)  =  j  (t*  —  o*)*.  i'  is  the  target  output  for  the  input  pattern  x*,  and  o*  the  corresponding  network 
output.  The  superscripts  denote  the  pattern  number.  The  vector  w  represents  the  weight  vector  of  the 
network.  It  is  comprised  of  fit,  the  weights  from  the  hidden  units  to  the  output  unit  and  A,  the  weights  from 
input  to  hidden  nodes.  Individual  weights  in  O  are  denoted  Qj,  t  =  1, . . . ,  +  i.  Weights  in  A  are  further 

divided  into  Aj, i  =  the  weight  vectors  from  the  input  nodes  to  the  hidden  unit.  A  particular 

weight  that  connects  hidden  node  t  and  input  j  is  denoted  Aj,,-.  The  Qb+i  weights  in  w  represent 

the  bias  weights  which  have  an  activation  of  unity. 

As  prescribed  by  the  backpropagation  algorithm  the  weight  vector  w  b  updated  according  to  the  gradient 
descent  rule  i.e.  Aw  =  — qVw^(w).  where  q  represents  the  step  sue  or  learning  rate. 


(1) 


The  weight  update  rules  can  be  written  explicitly  as 

An,  = 

(2) 

1=1 

is  the  activation  for  hidden  node  t  and  h\,  is  the  activation  for  the  output  node.  h\  b  the  output  of  hidden 
node  i  and  is  the  component  of  the  input  pattern.  The  terms  and  unity,  for  the  update 

of  the  bias  weights.  In  addition,  /(•)  is  the  nodal  activation  function  and  /'(■)  its  first  derivative.  Throughout 
this  discussion  it  is  assumed  that  /(•)  is  at  least  once  continuously  differentiable,  bounded  and  monotonically 
increasing.  The  sigmoid  function  f(x)  =  1/(1  +e~')  is  an  example  of  such  a  function.  This  function  is  used 
in  all  simulations  presented. 

The  update  equations  can  be  interpreted  as  a  set  of  linear  equations  in  the  term  (t'  —of).  Let  e'(w)  =  —</ 

and  e(w)  =  [c^(w) . .  .e^(w)]^.  e(w)  is  the  error  vector.  The  weight  update  equations  can  now  be  rewritten 
as 


An 

=  »?An(w)e(w) 

(3) 

where  [An(w)]^ 

II 

(4) 

AA, 

=  ’7A;^.(w)e(w) 

(5) 

where  (Aa.(w))^ 

(6) 

[A],,  =  [a^]^  is  used  to  denote  a  matrix  A,  with  element  at  row  q  and  column  r.  These  equations  can 
be  further  SM:cumulated  into  one  matrix  equation  for  the  simultaneous  update  of  all  the  weights  in  w.  When 
equations  4  and  6  are  combined,  the  result  is 


Aw  =  J7A(w)e(w),  where  A(w)  = 


An(^) 


^A,(^) 


-  An(w) 


(7) 


A(w)  is  the  system  matrix. 

We  assume  that  P  <  (H  + 1)  +  H(I  + 1),  that  is,  the  number  of  training  patterns  is  less  than  the  number 
of  weights  and  biases  in  the  network.  This  assumption  holds  for  many  applications  in  which  artificial  neural 
networks  have  been  used. 


3  Condition  Number  of  A(w)  and  Network  Convergence 

This  section  discusses  the  use  of  the  condition  number  as  a  relative  measure  of  rank  in  a  matrix.  The  result 
is  an  interpretation  of  the  system  matrix  which  is  illustrated  through  simulation. 

The  condition  number  of  a  matrix  is  the  ratio  of  the  largest  and  smallest  singular  values  of  a  matrix  [6]. 
A  large  condition  number  corresponds  to  a  matrix  which  has  columns  which  are  nearly  lineuly  dependent. 
This  implies  that  rank  of  the  matrix  is  nearly  deficient.  The  condition  number  provides  relative  information 
about  the  linear  dependence  of  the  columns  of  a  matrix  rather  than  the  absolute  information  given  by  the 
rank  (i.e.  either  dependent  or  independent). 

Using  this  linear  dependence  approach  equation  7  can  be  rewritten  as 

Aw  =  ^a'(w)e'(w)  (8) 

1=1 

where  a'(w)  is  the  /**  column  of  A(w)  and  e‘(w)  is  the  scalar  error  for  the  pattern.  The  term  a'(w)c'(w) 
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Figttie  1:  Condition  Nnmbei  and  Network  Enot  Trajectories 


represents  the  gradient  of  the  cost  function  for  the  pattern.  Thus  equation  8  is  the  vector  sum  of  P 
gradient  vectors. 

The  linear  dependence  of  the  columns  of  A(w)  (i.e.  the  rank)  determines  the  number  of  directions  that 
Aw  can  take.  If  the  rank  of  A(w)  is  deficient,  then  the  directions  of  Aw  are  restricted,  in  fact  there  exists  a 
subspace  of  error  vectors,  e(w)  jt  0,  which  will  result  in  Aw  =  0.  The  algorithm  has  become  stuck  in  a  local 
minimum.  If,  however,  the  rank  of  A(w)  is  fuU,  then  Aw  #  0  for  e(w)  ^  0  and  the  training  error  continues 
to  decrease.  Note  that  if  e(w)  =  0  the  global  solution  has  been  reached.  For  this  analysis  strict  gradient 
descent  is  always  guaranteed,  thus  there  is  no  possibility  of  limit  cycle  behaviour. 

Because  of  the  continuous  nature  of  the  activation  functions,  it  is  unlikely  that  the  system  matrix  will 
lose  rank.  However  in  the  neighbonrhood  of  a  local  minima,  the  columns  of  A(w)  will  become  near/y  linearly 
dependent.  Thus  it  is  appropriate  to  use  the  condition  number  as  a  measure  of  the  linear  dependence  of  the 
e<fiamns  of  A(w). 

4  Simulation  Examples 

To  study  the  Hnear  dependence  of  the  columns  of  A(w),  the  condition  number  of  the  system  matrix  and 
the  training  error  of  an  Exclusive-Or  (one  layer  of  two  hidden  units)  network  were  tracked  (see  figure  1). 
The  graphs  show  that  the  condition  number  and  error  fall  tc^ether.  This  is  a  result  of  the  columns  of  A(w) 
becoming  more  linearly  independent.  In  trials  where  convergence  was  not  attained  the  condition  numbers 
remained  higher  than  those  of  the  converged  trials.  Extended  rimnlation  runs  revealed  that  high  condition 
numbers  may  be  associated  with  local  minima.  Note  that  one  of  the  trials  showed  an  increase  in  condition 
number  with  continued  training  and  thus  may  be  stuck  in  a  local  minimum. 

Final  s<dntion  optimality  was  also  characterised  using  the  condition  number.  A  set  of  triab  were  run 
for  a  fixed  duration.  The  condition  number  of  the  system  matrix  was  then  calculated  for  each  trial.  The 
convergence  criteria  was  a  sum  of  squared  error  less  than  0.05.  Five  hundred  Exclusive-Or  networks  (as 
above)  were  simulated.  Figure  2  shows  the  results  of  the  simulation.  Of  the  500  trials,  299  converged  and 
201  did  not.  The  bars  represent  the  mean  condition  number  of  the  final  system  matrix.  Note  that  for  the 
converged  trials  the  final  condition  number  of  the  system  matrix  is  much  smaller.  This  indicates  a  strong 
correlation  between  network  convergence  to  a  globally  optimal  solution  and  a  low  system  matrix  condition 
number. 


Figure  2:  Mean  condition  numbers  of  converged  and  unconverged  triab  after  10000  epochs 


5  Enhancing  convergence  by  conditioning  A(w) 

Simulation  results  have  indicated  that  there  is  a  strong  correlation  between  a  low  condition  number  and  final 
solution  optimality.  In  this  section  algorithms  that  use  this  relationship  are  discussed. 

5.1  Increasing  the  condition  number  of  A(w) 

Before  examining  the  algorithms,  it  is  useful  to  outline  some  of  the  properties  of  the  system  matrix  that  they 
exploit. 

In  examining  the  rank  of  the  system  matrix  only  the  columns  were  considered  for  linear  independence. 
The  rank  calculation  can  also  be  viewed  as  an  examination  of  linearly  independent  tows  of  the  system 
matrix.  In  particular,  many  of  the  rows  of  the  submatrix  A^(w)  are  linearly  dependent.  To  see  this  note 
equation  6.  The  column  of  A^.(w)  shares  the  term  /'(z|)/'(Ae)n,-  Since  /'(•)  >  0  for  any  finite 

argument,  can  be  eliminated  from  each  column  without  changing  the  rank  of  A^^(w).  Thus 

7^(A^^  (w))  =  )  where  m  •)  is  the  tank  operator.  Note  that  the  simplified  matrix  is  independent  of  t, 

implying  =  ^(A^^(w)).  Thus,  many  of  the  rows  of  7l(A^(w))  do  not  contribute  to  increasing 

the  condition  number  of  A(w).  This  is  an  inherent  redundancy  in  backpropagation  that  can  be  exploited  to 
increase  convergence  performance. 

This  simplification  can  also  be  performed  for  A(|(w).  However,  the  resulting  rank  b  the  same  as  that 
of  Aq(w).  This  u  because  the  terms  of  the  simplified  matrix  are  functions  of  the  network  weights  which 
change  over  the  training  period.  Thus  no  rows  can  be  eliminated  due  to  linearly  dependency. 

5.2  Derivative  Noise 

One  ad  hoc  method  of  achieving  the  modulation  of  the  elements  of  A^(w)  is  to  randomize  them  by  adding 
noise.  If  the  terms  in  the  matrix  are  disturbed  using  noise  more  rows  can  potentially  contribute  to  the 
condition  number.  At  the  same  time,  however,  the  guarantee  of  strict  gradient  descent  is  lost.  As  a  result 
this  method  can  be  very  sensitive  to  the  m^nitude  of  noise  used.  In  spite  of  this,  the  noise  method  has 
been  shown  to  prove  effective  by  Falhman  [2]  who  used  the  noise  to  prevent  saturation  of  the  derivative  of 
the  activation  function.  When  a  small  amount  of  noise  was  added  to  the  derivatives,  the  networks  converged 
faster  and  were  less  likely  to  get  stuck  in  local  minima.  In  other  words,  by  keeping  more  rows  linearly 
independent,  the  algorithm  could  proceed  without  bring  impeded  by  local  minima. 

5.3  Cross  Entropy  versus  Sum  of  Squared  Error 

Another  method  of  increasing  convergence  performance  is  to  increase  the  magnitude  of  the  weight  update 
vector.  Thus  a  larger  step  is  taken  down  the  error  surface  during  each  weight  update,  resulting  in  faster 
convergence. 

The  system  matrix,  A(w),  can  be  simplified  by  eliminating  the  term  f'{h[)  from  each  column.  This  is  in 
direct  analogy  with  the  simplification  of  7l(Aj^.(w)).  The  sigmoidal  activation  function,  f(x)  =  1/(1 +  e~*). 
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Figaie  3:  Effect  of  nurnbei  of  hidden  units  on  network  convergence 


has  the  property  that  0  <  /'(z)  <  1.  That  is  /'(h^)  has  the  effect  of  attenuating  the  matrix  elements  and 
thus  the  magnitnde  of  the  weight  update  vector.  The  removal  of  this  term  should  accelerate  convergence. 

This  approach  has  been  used  in  training  algorithms  that  have  a  cross  entropy  [10]  type  cost  function.  The 
cross  entropy  cost  function  has  the  form 


E(w)  = 


(l-o«)/ 


(9) 


for  a  single  output  network. 

When  weight  update  equations  are  calculated  for  this  cost  lunction,  the  result  is  a  system  matrix  without 
the  /'(h^)  term.  These  algorithms  yield  faster  convergence  when  compared  with  the  sum  of  squared  error 
cost  function  as  shown  in  [10]  and  [11]. 

5.4  Increasing  number  of  hidden  units 

The  condition  number  of  A(w)  can  also  be  increased  by  increasing  the  number  of  rows.  In  particular,  adding 
rows  to  Aq^w)  is  equivalent  to  adding  more  hidden  units  to  the  network.  It  follows  that  more  hidden 
nodes  should  increase  the  possibility  of  obtaining  a  better  conditioned  system  matrix.  A  similar  redundancy 
approach  is  taken  by  Isui  and  Pentland  [5].  They  show  that  when  redundant  nodes  (input  and  hidden)  are 
added  to  the  network  they  speed  up  convergence.  In  figure  3  the  number  of  converged  networks  is  plotted  for 
the  Exdusive-Or  example  with  varying  hidden  layer  sises.  Note  the  increase  in  the  percentage  of  converged 
networks  as  the  number  of  hidden  units  is  increased.  In  &ct,  with  four  or  more  hidden  units  all  the  triak 
converged.  Figure  3  also  shows  the  number  of  epochs  required  to  achieve  the  convergence  criterion.  As  the 
number  of  hidden  units  increases  the  number  of  epochs  needed  to  achieve  convergence  decreases. 
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6  Conclusions 


In  ihis  paper,  a  linear  algebraic  formulation  has  been  described  that  is  useful  in  predicting  the  convergence 
performance  of  algorithms  that  are  based  on  backpropagation  of  errors. 

Simulations  showed  the  correlation  between  the  training  error  and  system  matrix  condition  number  tra¬ 
jectories. 

The  enhanced  performance  of  algorithms  with  derivative  noise  [2]  a  nd  cross  entropy  cost  functions  [10,  11] 
were  predicted  using  the  matrix  formulation.  In  addition,  improved  performance  due  to  additional  hidden 
units  was  predicted  by  the  formulation  and  illustrated  through  simulation.  The  approach  is  similar  to  that 
found  in  [5]. 
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Abstract — A  new  rapid  and  eflicient  learning  algorithm  (OHLO  Algorithm)  is  pro¬ 
posed  in  this  paper.  In  the  process  of  learning,  the  hidden  layers^  outputs  are  optimized. 
Experiments  on  XOR  problem,  3  bit  parity  problem  and  circle  decision  area  forming 
in-oblem  show  that  the  convergence  stability  and  the  training  speed  of  the  method  pro¬ 
posed  are  better  than  that  of  standard  BP. 

I.  INTRODUCTION 

In  recent  years,  MLNN  with  the  standard  BP  algorithm  has  been  applied  to  many 
scientiflc  fields  successfully.  However,  applications  are  limited  because  of  the  slowness 
of  learning  process.  Research  for  method  of  speeding  up  supervised  learning  for  MLNN 
is  very  important  in  both  practice  and  theory.  Many  techniques  have  been  studied  to 
speed  up  back  propagation,  such  as  dynamic  adaptation  of  learning  parameters  [2,5], 
optimizing  weights  based  on  Kalman  Filter  principle  [4]  and  conjugate  gradient 
algorithm  [1,3].  The  convergence  speed  of  all  the  methods  presented  which  based  on  the 
back  propagation  principles  reported  to  achieve  several  to  ten  times  faster  than  the 
stand^d  BP. 

Here  we  address  a  new  learning  algorithm  in  which  learning  for  MLNN  can  also  be 
regarded  as  parameters  estimation  for  a  multi-inputs  and  multi-outputs  nonlinear  sys¬ 
tem.  But  the  new  method  use  different  principles  for  learning  compared  to  that  of  ^e 
standard  BP.  Layer  by  layer  optimization  which  optimizing  both  weights  and  inputs  of 
a  certain  layer  is  adopted,  in  which  the  optimized  inputs  are  taken  as  the  desired  outputs 
of  previous  layer.  We  name  it  the  OHLO  (Optimizing  Hidden  Layers^  Outputs) 
algorithm.  This  algorithm  can  perform  simple  computations  and  accelerate  the  learning 
speed  of  MLNN  considerably. 

The  new  algorithm  is  introduced  in  detail  in  Section  II;  the  simulation  results  tested 
on  XOR  problem,  3  bit  parity  problem  and  circle  decision  area  forming  problem  are 
shown  in  Section  HI;  In  Section  IV,  some  important  conclusions  are  given. 

n.  LAYER  BY  LAYER  OPTIMIZING  HIDDEN  LAYERS'  OUTPUTS  LEARN¬ 
ING  ALGORITHM 

Let  us  assume  any  layer  of  MLNN,  such  as  the  mth  layer,  consisting  of  L  inputs, 
which  are  yp„  y^j, ...  .ypt.  respectively.  Denoted  as: 

(i) 
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where  p  is  referred  to  the  pth  pattern  of  training  set.  p  =  1 ,2,3...P.  Usually,  Ypt  =  ~1  • 
Outputs  of  the  mth  layer  which  are  Op„Op2,...,Op^„  is  denoted  as: 

(2) 

The  mapping  of  input  vector  Yp  to  output  vector  Op  implemented  by  the  mth  layer 
can  be  expressed  as: 

O^^AWY^)  (3) 

where  W  is  the  weights  matrix  of  the  mth  layer  neurons,  it  is  a  N  x  L  matrix. 

f(  •  )  is  a  nonlinear  activation  function  of  neuron,  typically,  it  is  sigmoid  function. 


/(x)  = 


Assume  the  desired  responses  of  the  mth  layer  for  pth  training  pattern  to  be 
dp„dp2 . dpi4,  which  is  denoted  as: 

^ (5) 
The  squared  errors  of  the  mtH  layer  outputs  for  the  pth  training  pattern  is  defined 


The  global  error  function  E  for  all  training  patterns  is: 

»-«  •*  »-»  *  ,-i  ,.i 

Sub-algorithms  and  formulae  used  in  this  method  are  introduced  as  following: 

A.  Weights  Optimizing 

The  function  of  this  sub— algorithm  is  to  minimize  the  output  squared  errors  of  a 
certain  layer,  such  as  the  mth  layer,  by  optimizing  this  layer's  weights  when  the  inputs 
are  fixed.  The  steepest  descent  combined  with  line— searching  technique  is  adopted  here. 
The  weights  adjustment  formula  is 

AR'=-mV£  (8) 
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Where  M>0.  R. 


=  (9) 

"  II  »-< 

I  =  1.2 . A^;/=  1,2,.. ..L. 

/i  in  (8)  is  known  as  learning  rate.  It  is  a  tuned  parameter.  Selecting  proper  /i,  we 
can  get  local  minimum  of  the  error  function  along  the  gradient  direction.  Line-search¬ 
ing  technique  is  used  for  solving  this  problem 

B.  Inputs  Optimizing 


The  function  of  this  sub-algorithm  is  to  minimize  the  output  squared  errors  of  a 
certain  layer  by  optimizing  this  layer^s  inputs  when  the  weights  are  fixed.  Then  we  take 
the  optimized  inputs  of  the  mth  layer  as  the  desired  outputs  of  the  (m-l)th  layer. 

Because  the  outputs  of  neurons  take  value  in  the  range  of  (0,1),  steepest  descent 
and  constrained  line-searching  techniques  are  adopted  here. 

By  the  definition  of  gradient,  we  get 


V£ 


9E  9E 


9E 


■oy 


(10) 


Where 


•Ml."--”--'] 


9E 

1,2,...,/*;/  =  1,2,..., L  -  1. 


(II) 


••I 


According  to  the  steepest  descent  principle,the  inputs  adjustment  formula  is 

Ay=  -n'^E^ 


(12) 


Wherefy>0,  ij^R. 

>7  in  (12)  is  named  as  inputs  optimizing  rate. 

Use  the  line-searching  technique  to  seek  the  optimal  i/.  Notice  that  the  object  func¬ 
tion  in  line-searching  procedure  is  different  from  that  in  weights  optimizing  procedure. 
Constrained  object  function  with  a  constrained  term  add  to  Equ.  (7)  should  be  used 
here.  We  select  the  added  constrained  term  Ec  as 

£c=  t  ('3) 

l-l 


C.  Layer  By  Layer  Optimizing  For  MLNN 
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The  first  step  of  the  training  process  of  this  algorithm  is  to  optimize  the  weights 
of  the  Mth  layer  (output  layer).  After  this  step,  optimize  the  inputs  of  this 
layer  using  the  sub-algorithm  discussed  in  section  B  with  the  newly  weights,  then  take 
the  input  vectors  of  the  Mth  layer  as  the  desired  output  vectors  (d^“”'^  of  the 
(M-l)th  layer.  Do  the  same  optimizations  for  the  (M-l)th  layer,  the  (M-2)th  layer, ... , 
until  reach  the  first  layer.  Because  the  inputs  of  first  layer  are  the  inputs  of  training  pat¬ 
tern  vectors,  only  weights  optimizing  is  needed  to  take  into  account.  After  one  training 
cycle  is  finished,  if  the  output  error  of  the  networks  is  not  satisfied  with  the  requirement, 
start  a  new  training  cycle  from  the  Mth  layer,  iterate  repeatedly  layer  by  layer,  stop 
learning  when  networks  is  converged. 

Learning  algorithm  can  be  summarized  as  following; 
step  1 :  17  >  0, 17  >  0,  Emax  chosen; 

Weights  are  initialized  at  random  value; 
m  =  M;  Set  initial  iterate  number  q  =  0; 
step  2:  Training  step  starts  here 

The  training  pattern  is  presented  and  the  layers'  outputs  computed, 
siep  J:  Optimize  weights  of  layer  m.  Take  optimized  weights  using' the  method  dis¬ 
cussed  in  section  A  as  the  new  weights  of  layer  m. 
step  4:  If  m  =  l,  then  go  to  step  6;  Otherwise,  optimize  inputs  of  layer  m  using  the 
method  introduced  in  section  B  as  the  desired  responses  of  layer  m— 1. 
step  5;  Set  m  =  m-1 ,  go  to  step  3; 

step  6:  Total  error  is  computed,  E=  '* 

step  7:The  training  cycle  is  completed,  set  q  =  q+ 1 ;  If  E  >  Emax,  then  set  m  ==  M,  and  in¬ 
itiate  the  new  training  cycle  by  going  to  step  2;  If  E<  Emax,  stop  training.  Output 
weights  W^’*^  (m  =  1,2,...,M),  iteration  number  q  and  total  error  E, 

It  is  veiy  interesting  to  point  out,  owing  to  the  independence  of  weights  and  inputs 
optimization  in  every  layer,  all  the  iteration  processes  can  be  proceeded  from  the  input 
layer  to  the  output  layer,  or  vice  versa,  except  for  the  first  iteration. 

m.  EXPERIMENT  RESULTS 

The  XOR  problem,  the  3  bit  parity  problem  and  circle  decision  area  forming  prob¬ 
lem  are  used  in  our  experiment  for  the  comparision  of  the  performances  of  the  proposed 
algorithm  with  that  of  the  traditional  BP  algorithm.  Table  I~  III  are  the  results  of  our 
test. 

Circle  Decision  Area  Forming  (CDAF)  Problem 

The  CDAF  problem  used  in  our  tests  may  be  defined  as  follows:  if  the  input  pat¬ 
tern  X  fallen  inside  a  certain  circleof  radius  r,=  1,  the  desired  output  is  1;  Otherwise, 
when  the  input  is  outside  the  concentric  circle  with  a  little  bit  too  large  radius,  say 
1.1,  the  desired  output  is  0.  Let  the  training  patterns  be  the  points  on  the  two  circles 
of  every  45  degree.  16  sampled  training  patterns  and  their  desired  responses  can  be  ex- 
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pressed  as  following: 

(xl,x2,d)={  (1  *  cos(3.14/4*  i),  1  *  sin(3.14/4*  i),  1); 

(1.1  *  cos(3.14/4*  i),  1.1  *  sin(3.14/4*  i),0); 
i  =  0.1,2....7} 

Three  kinds  of  MLNN  architecture  have  also  been  used  in  this  experiment. 
Emulation  results  are  listed  in  table  III. 

Table  I  Comparing  experiments  results  from  the  XOR  problem 


Networks 

Two  layers 

Three  Layers  j 

Four  Layers 

Structure 

2->2->l 

2->5->l 

i _ 

2- >5- 

->3->l 

2->5->4->3->l 

Algorithm 

BP 

OHLO 

BP 

OHLO 

BP 

OHLO 

BP 

Iterations 

60 

16140 

46 

12120 

65 

12399 

44 

15422 

MSE  Error 

D 

10“’ 

D 

10~* 

0 

1 

10"* 

10"* 

Table  11.  Comparing  experiments  results  from  the  3  bit  parity  problem 


Networks 

1 - 

Two  layers 

1 - 

Three  Layersl 

1 - 

Four  Layers 

Structure 

1 

V 

1 

V 

3->8->l  1 

3->5- 

->3->l 

3->5->4->3->l 

Algorithm 

OHLO 

BP 

OHLO 

BP 

OHLO 

BP 

OHLO 

BP 

Iterations 

78 

12566 

71 

6921 

104 

4910 

66 

53000 

MSE  Error 

D 

10"* 

n 

10"* 

0 

10"* 

0 

10"* 

Table  III  Comparing  experiments  results  from  the  CDAF  problem 


Networks 

Two  layers 

Three  Layers 

Four  Layers 

Structure 

!  2->3*>l 

1 

2->10->l 

2->5- 

1 

->3->l 

2->5->4->3->l 

Algorithm 

OHLO 

BP 

OHLO 

BP 

OHLO 

BP 

OHLO 

BP 

Iterations 

2022 

145300 

1402 

30156 

1598 

124520 

541 

67000 

MSE  Error 

B 

10"* 

B 

10"* 

10"* 

10"*  ' 

10"^ 

10“* 

The  results  show  that  the  convergence  property  is  more  sensitive  to  the  initial 
weights  in  BP  algorithm,  than  that  of  OHLO  algorithm.  Furthermore,  in  the  OHLO, 
learning  process  converges  with  higher  speed  and  stability.  Especially,  when  the  squared 
errors  is  less  than  10“'  in  OHLO,  learning  begins  to  converge  very  quickly.  As  in  the 
learning  process  of  BP  algorithm,  when  the  squared  errors  decrease  to  a  certain  degree. 
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the  convergence  speed  is  getting  slower  and  slower,  and  the  error  is  hardly  to  become 
zero. 

Although  the  time  required  for  one  iteration  by  OHLO  is  about  6  times  of  that  re¬ 
quired  by  BP,  The  total  number  of  iterations  for  convergence  of  OHLO  is  2-3  order  of 
magnitude  less  than  that  of  BP  with  the  same  MLNN  architecture  and  initial  weights. 
So  OHLO  can  yield  an  acceleration  of  about  1-2  order  of  magnitude  compared  to  BP 
tested  on  above  three  problems. 

We  can  also  conclude  from  the  experiments  that  local  minimum  problem  exists  too 
in  OHLO  as  it  does  in  BP.  But  the  probability  of  meeting  local  minimum  resolution  is 
less. 

IV.  CONCLUSIONS 

In  this  paper,  we  proposed  a  learning  algorithm  for  training  MLNN  by  Optimizing 
the  Hidden  Layers'  Outputs  (OHLO),  which  is  different  from  Back  Propagation  Princi¬ 
ple  and  get  more  effective  than  the  standard  BP  algorithm.  Simulation  results  show  that 
OHLO  is  an  order  of  magnitude  or  more  faster  than  standard  BP  and  converge  more 
stably  when  tested  on  the  XOR  problem,  3  bit  parity  problem  and  circle  decision  area 
forming  problem. 

OHLO  is  suitable  for  MLNN  with  any  layers  proved  by  our  experiments.  Optimiz¬ 
ing  layer  by  layer  independently  make  the  new  method  more  simple  to  realize  with  par¬ 
allel  processing. 

Another  advantage  of  OHLO  is  full-automated,  includes  no  critical  user-depen¬ 
dent  parameters  with  respect  to  BP  in  which  the  values  of  these  parameters  are  often 
crucial  for  the  success  of  the  algorithm.  So  the  new  method  is  more  easy  for  practical 
applications. 

Of  course,  local  minimum  problem  can  not  be  eliminated  completely  in  OHLO,  It 
is  a  problem  remained  to  be  solved. 
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Investigating  the  GA/Back  Propagation  Trade-Off 

in  the  GANNet  Algorithm 
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Abstract:  Designing  and  training  neural  networks  are  very  difficult  tasks,  requiring  much  trial  and  error, 
and  the  standard  layered  network  architectures  do  not  map  naturally  into  hardware.  This  paper  investigates 
a  novel  genetic  algorithm  which  is  used  to  design  the  topology  and  find  the  link  weights  for  a  layered, 
feedforward  neural  network.  The  topologies  aie  not  limited  either  in  the  number  of  layers  or  in  the  number 
of  nodes  per  layer.  A  robust,  global  search  is  conducted  by  the  genetic  algorithm  over  both  the  link  weight 
and  topology  spaces,  after  which  back  propagation  can  be  used  to  quickly  find  the  desired  link  weights. 
Thus,  both  ^  GA  and  back  pn^gation  can  be  used  to  their  greatest  advantage:  the  global  search  of  the 
GA  can  find  the  an>roximate  area  of  a  solution,  and  back  propagation  can  then  quickly  find  the  local 
optimum.  The  point  at  which  the  GA  should  be  terminated  and  b^k  propagation  should  be  employed  is 
iUustrated  for  two  example  problems. 

Keywords:  neural  network,  genetic  algorithm,  multilayer  perception 


Introduction 

Creating  an  appropriate  neural  network  for  a  given  application  is  a  multivariate 
optimization  problem  with  few  (if  any)  reliable  guidelines.  Typically,  a  guess  is  made  of 
the  required  topology,  often  using  some  rule  of  thumb,  and  training  is  attempted.  If 
training  fails,  it  may  be  that  the  network  is  too  large,  too  small,  or  the  initial  point  in  the 
weight  space  was  too  poor  to  allow  the  network  to  learn  the  task.  Therefore,  the  designer 
has  gained  little  knowledge  to  guide  the  next  attempt,  which  can  result  in  a  large  amount 
of  trial  and  enor  before  success  is  obtained. 

One  multivariate  optimization  technique  which  has  recently  gained  a  great  deal  of 
attention  is  the  genetic  algorithm  (GA).  The  virtue  of  the  GA  is  that  the  relationship 
between  the  parameters  to  be  optimized  and  the  evaluation  function  need  not  be  known 
for  the  GA  to  be  successful.  This  is  very  fortunate,  since  the  relationship  of  neural 
network  parameters  and  the  network's  success  in  learning  the  task  at  hand  is  definitely 
unknown.  Another  virtue  of  the  GA  is  that  the  evaluation  function  can  be  crafted  to 
optimize  those  features  of  the  solution  which  are  most  important.  For  more  information 
on  GAs,  see  [Gol89]. 

In  the  GANNet  algorithm,  the  "parameter  string"  is  a  neural  network.  Each  allele 
is  a  node  of  the  neural  network  with  its  input  links.  Tbis  includes  the  transfer  function 
type  (which  is  either  sigmoid  or  Gaussian),  transfer  function  slope,  scale,  and  offset,  the 
input  link  weights,  and  the  index  of  the  node  of  origin  on  the  previous  layer  for  each  link. 
Thus,  each  allele  is  a  feature  detector  (since  that  node  will  respond  to  some  set  of  features 
in  the  input  vectors)  and  therefore  the  search  is  conducted  in  "feature  detector  space", 
rather  than  link  weight  space  or  topology  space.  In  short,  the  algorithm  searches  for  the 
proper  set  of  feature  detectors  to  solve  die  problem  at  hand. 


Although  the  algorithm  finds  solution  networks  a  large  percentage  of  the  time,  the 
time  required  to  reach  a  solution  is  not  bounded,  and  as  with  all  GAs,  no  convergence 
theorem  exists.  Therefore,  it  may  be  advantageous  to  terminate  the  GA  before  a 
satisfactory  solution  network  has  t^n  found,  and  use  back  propagation  (or  whatever 
local  algorithm  is  preferred)  on  the  best  network  to  find  the  solution  more  quickly.  For 
this  purpose,  two  example  problems  were  chosen:  the  well-known  exclusive-or  problem, 
and  a  synthetic  binary  problem.  A  synthetic  problem  was  devised  which  would  simple 
enough  to  allow  the  results  to  be  easily  understood,  yet  pose  enough  challenge  to  rise 
above  the  trivial.  For  this  pu^ose,  a  binary  problem  with  eight  inputs  and  one  output 
was  chosen,  with  the  output  being  the  following  binary  function  of  the  inputs: 

O  =  17  &  (as  &  15)  I  a4  &  16)) 

10, ...,  17  are  the  individual  inputs,  is  the  logical  AND  function,  and  "I"  is  the  logical 
(inclusive)  OR  function.  Inputs  11, 12,  and  18  are  not  used  to  compute  the  output.  All 
trials  used  whole  population  replacement,  two  point  crossover,  the  "mutate  nodes" 
mutation  operator  [Whit93],  and  a  distributed  GA  (as  in  [WS90])  with  20  subpopulations 
of  ten  individuals  each. 

Results 

First,  the  difference  between  the  desired  error  (2%)  and  the  best  initial  error  was 
calculated.  The  difference  is  divided  by  10  to  produce  "checkpoints"  at  every  10%  of 
error  reduction,  from  merely  using  the  best  initial  network  (0%)  to  allowing  the  GA  to 
reduce  the  error  to  the  desired  2%  threshold  (which  is  100%  of  the  error  to  be 
eliminated).  At  each  checkpoint,  the  best  performing  network  is  trained  using  back 
propagation.  If  the  genetic  operators  are  performing  as  desired,  the  number  of  hidden 
nodes  should  show  a  decrease  with  the  amount  of  the  original  error  the  GA  is  allowed  to 
eliminate,  since  the  topology  search  of  the  GA  will  have  had  more  time  to  find  better 
topologies.  Table  1  presents  the  averages  recorded  over  ten  trials,  and  Figure  1  shows  a 
graphical  representation  (which  is  a  bit  easier  to  grasp). 


Success 

Number  of 

Number  of 

Number  of  Number  of 

Error 

Rate  (%) 

Generations 

Enochs 

Lavers 

Hidden  Nodes 

0 

90 

0 

257 

2.4 

4.6 

10 

80 

1.6 

233 

2.3 

2.8 

20 

- 

- 

- 

- 

.  * 

30 

50 

4 

519 

2 

1.5  * 

40 

100 

4.5 

4.5 

1.5 

0.5  ♦ 

50 

100 

15 

11 

1.8 

1  ♦ 

60 

100 

17.3 

4.7 

1.7 

1  ♦ 

70 

100 

5 

4.6 

1.8 

1 

80 

100 

13 

3.7 

1.5 

0.8 

90 

100 

21.6 

2.1 

1.7 

0.9 

100 

100 

39.5 

0.3 

1.7 

0.9 

Table  1 

GA/Back  Propagation  Trade-Off  for  the  Exclusive-Or  Problem 
An  asterisk  (*)  indicates  that  there  are  fewer  than  five  data  points. 
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20  40  60  80 

Percent  of  Original  Error  Eliminated  by  the  GA 

Figure  1. 

Exclusivc-Or  GA/Backprop  Trade-Off 


These  results  indicate  that  the  genetic  topology  search  is  performing  as  desired. 
The  gn^h  shows  a  clear  trade-off  between  the  number  of  GA  generations  and  the  number 
of  back  propagation  epochs  and  network  size.  An  optimal  trade-off  point  seems  to  be 
around  the  point  at  which  the  GA  has  eliminated  80%  of  the  original  error.  After  this 
point,  the  GA  spends  a  lot  of  time  settling  in  on  the  final  solution.  There  also  seems  to  be 
an  optimum  at  the  40%  mark,  but  too  few  data  points  exist  in  this  region  to  reach  any 
conclusion.  The  major  netwoik  size  reduction  seems  to  be  in  reducing  the  first  40%  of 
the  error,  which  makes  some  sense,  since  a  "cheap"  way  to  increase  fimess  is  to  decrease 
network  size  (thereby  decreasing  the  node  pendty  used  in  the  evaluation  (or  fitness) 
function).  It  so  happens  that  this  problem  is  simple  enough  to  be  solved  by  the  smaller 
netwoik& 

The  virtue  of  allowing  Gaussian  nodes  in  the  networks  is  also  demonstrated  by 
these  results,  since  3  of  the  ten  trials  resulted  in  networks  with  no  hidden  layers  at  all 
(using  a  single  Gaussian  node  as  the  output),  and  five  of  the  other  trials  had  hidden  layers 
consisting  of  a  single  Gaussian  node.  Thus,  in  cases  where  a  subfunction  of  the  output  is 
of  a  Gaussian  form  naturally  (as  in  this  case),  the  ability  to  include  Gaussian  nodes  in  the 
networks  allows  far  more  compact  topologies  than  networks  with  only  sigmoid  nodes. 

Next,  the  synthetic  problem  was  used,  with  the  same  procedure  as  before.  Table  2 
presents  the  test  data,  with  Figure  2  showing  the  graphical  representation.  Ihe  initial 
networks  tended  to  be  small,  b^ause  all  of  the  networks  performed  rather  poorly,  so  the 
discriminating  feature  was  the  size  of  the  network  (via  the  node  cost  in  the  fitness 
function).  Unfortunately,  these  networks  also  tended  to  perform  poorly,  as  evidenced  by 
the  low  convergence  rates.  Once  the  network  performance  became  the  discriminating 
feature,  the  network  size  began  to  grow,  and  the  convergence  rate  increased  as  well. 
There  seems  to  be  a  definite  optimal  crossover  point  at  around  80%  error  reduction, 
where  the  network  size  is  still  relatively  small  and  the  performance  is  good,  as  well. 
After  that,  the  GA  is  straining  to  find  an  optimum  performing  network,  and  increases  the 
size  of  the  network  to  achieve  it.  This  shows  the  tendency  of  ^e  GA  to  find  a 
neighborhood  of  the  solution  relatively  quickly,  but  to  have  great  trouble  in  finding  the 
actual  solution,  as  can  be  seen  from  the  fact  that  the  number  of  generations  rapidly 
increases  as  the  GA  converges  on  a  solution.  In  Hgures  1  and  2,  the  independent  variable 
is  the  percent  of  the  original  error  the  GA  is  allowed  to  eliminate  before  the  switch  to 
back  propagation  is  made. 


Success 
Rate  (%) 

Number  of 
Generations 

Number  of 
Epochs 

Number  of  Number  of 
Lavers  Hidden  Nodes 

0 

50 

0 

504 

2 

3 

10 

- 

- 

- 

- 

. 

20 

- 

- 

. 

. 

30 

38 

1 

633 

2 

3 

40 

0 

1.6 

1000 

2 

2.6 

50 

60 

2.3 

456 

2 

3 

60 

50 

4.5 

561 

2 

3.2 

70 

70 

7.6 

353 

2.3 

6.7 

80 

100 

12.9 

51 

2.3 

6.4 

90 

90 

17.9 

172 

2.4 

7.2 

100 

100 

40.9 

5.5 

2.7 

8.0 

Table  2 

GA/Back  Propagation  Trade-Off  for  the  Synthetic  Problem 
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Nvimber  of  Hidden  Nodes  Backprop 


0  20  40  60  80  100 

Percent  of  Original  Error  Eliminated  by  GA 

Figure  2 

Synthetic  Problem  GA/Backprop  Trade-Off 

Conclusion 
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This  paper  has  investigated  a  new  genetic  algorithm  which  develops  neural 
networks.  The  trade-off  between  the  global  genetic  search  and  a  local  search  algorithm 
(in  this  case  back  propagation)  was  investigated,  with  the  result  that  after  80%  of  the 
original  error  has  b^n  eliminated,  the  rate  of  progress  of  the  genetic  algorithm  decreases 
rapidly,  and  a  switch  to  the  local  algorithm  yields  faster  solutions.  In  fact,  on  harder 
problems,  the  GA  will  actually  increase  the  size  of  the  networks  to  reduce  the  last  bit  of 
error,  which  is  hardly  a  desirable  effect 
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Abstract 

Restart  concept  is  found  in  Conjugate  Gradient  method  to  find  the  best  error  decreasing 
vectors  when  it  fails  to  reach  a  minimum.  We  q>plied  this  restart  concept  to  the  error 
backpropagation  widt  gains  algoridim  and  obtained  a  good  result  This  paper  reports  the  idea 
and  ^  expmmental  results.  We  apply  die  restart  procedure  to  the  error  backpropagation  with 
gains  learning  not  only  periodic^ly  but  also  when  the  error  did  not  decrease  during  the 
previous  learning  epoch.  The  restart  procedure  resets  die  gains  and  the  bias  and  the  learning  is 
continued.  The  experiments  using  parity  and  encoder  problems  showed  that  the  propped 
approach  is  about  10  times  faster  in  learning  time  than  the  conventional  enror  backpropagation 
with  momentum  algorithm. 

1.  Background 

1. 1.  Summary  of  BPG 

There  have  recently  been  growing  interests  in  the  extension  of  the  conventional  error  backpropagation  with 
momentum(BP)  algorithm[l].  Our  interest  here  is  mainly  in  error  backpropagation  with  gains(BPG)[2] 
algorithm  and  the  neural  implementation  of  the  conjug^  gradientfCG)  algorithms[3,4].  Conjugate  gradient 
algorithms  are  generally  much  faster  and  can  result  in  an  impressively  low  error  level.  But  these  algoridmis  also 
are  particularly  vulnerable  to  being  trapp^  in  local  minima.  EspMially,  line  search  efficiency  and  initial 
weidite  set  are  critical  to  the  success  of  conjugate  gradient  algorithmsI4,5]. 

BIKi  algorithm  has  several  advantages  over  BP  algorithm;  it  is  faster  in  lowering  standard  deviation  of  error 
ovCT  the  entire  training  pattern  set  than  BP  during  the  learning  phasefti].  For  some  (noblems  BPG  is  much  less 
likely  to  become  trapi^  in  local  minima[7,6].  And  it  was  shown  that  BPG  is  2.2  times  on  die  average  faster  in 
some  parity  and  encodo'  probletna[21. 

However  BPG  has  a  shortcoming  too:  its  error-decreasing  rate  becomes  slow  in  die  later  learning  phase  which 
is  the  fact  common  to  all  gradient  descent  learning  algorithms.  The  focus  of  our  effort  was  directed  to 
overcoming  die  learning  slow-down  phenomena  genet^y  observed  in  the  gradient  descent  learning  algorithms 
including  BP  and  BPG. 

1. 2.  Restart  procedure  (rfCG  method 

The  learning  problem  of  neural  netwoiks  can  be  formulated  as  an  optimization  problem  in  numerical  analysis. 
Among  die  techniques  of  numerical  analysis  which  can  be  used  to  train  the  neural  networks,  conjugate  gradient 
method  is  relatively  easy  to  implement  and  inexpensive  in  computational  cost 

CG  tries  to  direcdy  reach  a  minimum  on  the  error  surface  by  using  quadratic  approximation,  saving  search 
effort  odierwise  required  by  gradient  descent  calculations.  But  due  to  the  gap  between  the  real  shi^  of  error 
surface  and  the  approximation  of  the  weights  of  the  minimum,  this  methcxl  usually  fails  to  directly  reach  a 
minimum  and  hence  need  more  trials  of  approximation. 

Whoi  it  fails,  this  mediod  uses  restart  procedure  to  again  guess  the  best  error-decreasing  direction  vectors 
from  the  last  gradient  descent  direction.  Then  the  regular  learning  opmtion  is  cmtinued.  We  ^^ly  this  restart 
concqit  of  CG  mediod  to  BPG  by  developing  a  restart  procedure  that  is  suitable  to  BPG. 

2.  Restart  Procedure 

2.1.  Viliai  to  restart  and  when  to  stop  restarting 

Restart  concern  BPG  algorithm[2]  is  based  on  the  idea  from  the  restart  in  CG  method;  the  improvement  on 
the  navi^tting  directions  of  error  surfm  in  weights  space. 

We  nae  the  BPG  algorithm  without  nMmentum.  A  restart  process  is  taken  pcrtodkally  for  every  2N 
epochs  of  learning  where  N  is  the  number  of  the  weights  in  the  network.  In  addition,  when  the  ernM*  did 
not  decrease  duiw  the  last  learning  epoch,  a  restart  process  is  also  taken  even  during  the  period 
mentkmed  above.  Iwe  choice  of  a  len^  of  tiw  period  is  heuristic,  at  this  moment.  We  found  after  many 
experiments  that  2N  is  about  die  opti^  at  least  for  tiiose  problems  mentioned  later. 

Aldwugh  restart  process  is  taken  periodically  from  the  beginning  of  die  learning  as  described  above,  it  is  not 
always  necessary  to  continue  the  process  forever.  More  specifi^ly,  we  can  stop  restarting  as  soon  as  the 
leaiiiiiig  readies  a  classifiabie  state.  A  daasifiable  state  h  a  learning  stage  of  the  netvrork  whore  every  node 
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of  the  outpat  layer  oui  classify  all  input  patterns  into  the  correct  pattern  clasecs  according  to  the  signs  of 
ttieir  net  inpnts.  For  example,  suppose  that  a  node  of  the  output  layer  is  to  classify  the  inputs  to  the  class  A  and 
the  non-class  A  patterns,  and  that  now  the  net  ii^ts  of  the  input  pattems  are  grou^  to  the  two  separue  sets  as 
shown  in  Fig.  1(a),  then  we  say  drat  the  node  is  in  a  classifiabie  state.  And  furdier  the  network  is  in  a  classifiable 
state  of  lear^g  if  all  nodes  of  die  output  layer  are  in  classifiable  state.  As  soon  as  we  reach  a  dassiflable  state 
of  learning,  a  certain  large  (fixed)  nsomentum  rate(e,g.  0.9)  is  riven  to  the  gain  of  each  node  and  each 
weight  so  fiiat  the  learning  can  be  finished  more  quicldy[l  1.12]  to  the  desim  error  level  At  the  same 
dme  we  mnltiply  a  certain  constant(e.g.  1.4)  to  every  wright  so  that  we  would  not  loae  the  ronte  to  a 
niinimumJnstfoand[9].  The  value  1.4  above  is  justified  as  follows.  If  you  take  0.9  for  a  momentum  ratt,  then 
the  largest  ovmhoot  possible  fiom  die  current  position  is  0.9  *  0.25  =  0.225  where  0.25  is  die  largest  derivative 
value  of  the  activation  function  f(x)=  1  /  (l4exp(-x)).  Therefore  any  small  value  close  to  but  greater  than  1.225 
can  be  used  as  die  multiplier  for  weights.  Here  we  picked  up  1.4. 

Here  we  assume  that  the  initial  weights  are  raiidomly  generated  fitim  the  real  range  [-0.5,  40.5].  And  each 
weight  vector  is  normalized  in  length.  Also  we  assume  diat  the  network  architecture  is  always  2-layer  feed¬ 
forward.  Let  rzf  be  the  activation  of  the  i-th  node  of  the  layer  S,  and  let  =  [of  “ be  die  colutim  vector 
of  activation  values  in  layer  S.  The  input  layer  is  the  layer  0.  Let  be  the  weight  on  the  connection  from  the  j- 

th  node  in  layer  S-1  to  the  i-th  node  in  layer  S,  and  let  M'f  be  the  column  vector  of  weights 

from  layer  S-1  to  the  i-th  node  in  layer  S.  The  given  weights  set  is  partitioned  into  K  vectors  if  there  are  K 
processing  nodes  in  the  network.  The  net  input  to  the  i-th  node  of  layer  S  is  defined  as 

netf  =  and  let  net^  =  [ncff  •■‘ftetl |  be  the  column  vector  of  net  input  values 

in  layer  S.  Then  die  activation  of  a  node  is  giv«i  by  the  function  of  its  net  input,  af  =  f^gfnetf  ^ ,  where  f  is 

conventional  sigmoid  function[l],  and  gf  is  a  real  number  called  the  gain  of  the  node. 

If  the  initial  weight  vectors  are  sufficiendy  anall,  then  the  distribution  of  the  net  inputs  to  an  outpri  node  shall 
be  in  the  unsaturated  region[10].  An  unsaturated  region  is  an  area  where  die  dnivative  of  die  signwid  activation 
function  f(x)s  1  /  (l4«xp(-x))  is  not  verjr  small(e.g.  not  less  than  a  tendi  of  the  maximum  value  of  the  derivative 
function)  so  that  the  error-correcting  signal  propagated  back  is  significant  and  hence  the  learning  proceeds 
relatively  fast 

Therefore  we  want  to  keep  all  net  inputs  to  the  output  node  in  the  unsaturated  region  until  the  netwcnlc  comes 
to  a  classifiable  state. 

By  Vising  the  bias  of  a  output  node  according  to  the  range  of  its  activation,  die  correspoi^ng  net  input 
distribution  is  stabilized  and  the  center  of  crnresponding  net  iiqxit  distribution  comes  near  die  origin  of  the  net 
input  axis  of  the  activation  function  so  diat  we  can  {Hvvent  a  premature  saturation  in  die  early  learning  phase. 
And  at  the  same  time  we  can  also  diminish  the  sensitivity  of  the  initial  weights. 


(a)  Classifiable  State  (b)  Bias  Adjustmem 

Fig.  1.  The  net  input  distribution,  unsaturated  region,  and  die  re-configuring  die  bias  in  a  output  node 
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1.2  How  to  restart 

We  the  bias  of  each  ou^Nit  node  to  the  value  0.S  at  the  initial  point  of  learning  and,  during  each  following 
lestait  process,  the  Mas  is  set  to  die  median  value  of  the  activation  distribution  of  die  output  node. 

We  set  the  gain  of  each  processing  node  to  1.0  at  the  restart  process  as  well  as  at  the  initial  setting  of  the 

learning.  Now  we  describe  specifically  what  to  do  for  each  restart  Let  pat^  be  the  i-th  component  of  the  q-th 
iiqiut  pattern  pat^,  m  be  the  number  of  components  in  an  input  patterns,  and  p  be  the  number  of  input  pattmis 
to  be  used  for  learning.  Systematically, 

pa/,  =  [pat^i  ’pat^f,pat^  3  {0.1}, Vt  3  {l,2,3,  -,m}  a  3  {l,2,3,  “,p}. 

Let  the  ouqwt  of  k-th  processing  node  of  the  output  layer  according  to  the  q-th  irqiut  pattern  beO,^ ,  and  die 

set  of  the  outputs  for  all  input  patterns  be  {Oj^}.  A  restart  process  consists  of  2  steps:  bias  adjustment  and  gain 
reset. 

Stq>  1:  Set  die  bias  of  k-th  output  node  bios^  ==  (inin{0;t)  +  max{0^})/2. 

Step  2:  Reset  the  gain  of  each  processing  node. 

gf  =  1 .0,  Vi  3  {1,2,* '  ■,/}  (cf.  1  is  die  number  of  nodes  in  S-di  layer.) 


3.  Experiments 

We  are  interested  in  the  comparison  of  die  learning  time  between  the  conventional  BP[1]  and  the  BPG  with 
restart  process  proposed  in  this  paper.  For  this  purpose  we  used  XOR,  S-bit  parity,  and  4-bit  encoder-decoder 
problems.  We  tried  each  problem  15  times  with  initial  weights  randomly  geiierated  within  the  real  range  of  [- 
0.5,  -i0.5]  each  time.  The  basic  network  architecture  used  is  feed-forward  ^ly  connected  2-layer  (hidden  layer 
and  output  layer  for  both  cases).  The  notation  of  network  structure  x-y-z  is  used  here  to  tell  that  the  network  has 
X  nodes  in  the  i.':put  layer,  y  no^  in  the  hidden  layer,  and  z  nodes  in  the  output  layer. 

For  XOR  and  5-bit  parity  problems,  we  used  2-2-1  and  5-5-1  network  stiuctures  respe^vely.  For  4-bit 
encoder  problem,  we  us^  4^1  netwoik  structure.  For  the  learning  by  conventional  BP  algorithm,  we  set  the 
learning  >«te  to  0.7  and  the  numientum  rate  to  0.9.  For  the  learning  by  the  BPG  with  restart,  we  set  the  learning 
rate  for  weight  change  to  1.0  and  the  learning  rate  for  gain  change  to  1.0.  And  the  momentum  rates  for  both 
weight  and  gain  were  set  to  O's. 

In  case  of  the  BPG  with  restart,  we  multiplied  each  connection  weight  by  1.4  and  set  die  momentum  rates  for 
both  weight  and  gain  change  to  0.9  in  order  to  accelerate  the  hirther  learning,  as  described  above,  as  soon  as  we 
reached  a  classifiidiie  state.  We  witnessed  through  three  experiments  that  the  learning  always  converged  to  a 
lower  error  level  without  deviating  from  the  point  where  a  classifiable  state  was  reached.  The  experimental 
results  are  summarized  in  3  graphs  and  3  tables.  A  failure  means  that  the  error  level  could  not  be  reached  within 
3000  learning  epo^  starting  with  the  given  initial  weights.  Inom  the  results  it  is  evident  that  the  consistency  in 
good  perfomnnoe  witii  the  BPG  witii  restart  and  diat^  learning  qieed  with  BPG  with  restart  is  about  lOtimes 
faster  than  that  of  BP. 

(Mean  Square  Error) 
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T^>le.  1.  XCHl  Probtem.  Avg.  Epochs  (No.  of  Inures)  /  Std.  Dev. 
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Table.  2.  5-bit  Parity  Problein,  Avg.  ^)ochs  (No.  of  Failures)  /  Std.  Dev, 
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Rg.4.  The  error  curve  for  4-bit  Encoder  problein 
Table.  3.  4-bit  Encoder  Problem,  Avg.  l^iochs  (No.  of  Failures)  /  Std.  Dev. 
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4.  Conclusion 

We  intnxhiced  a  lestait  concept  for  BI*G  teaming  algorithm.  We  described  when  to  restart  the  learning  as  well 
at  when  to  stop  the  restart  opoations. 

We  provided  some  of  the  reasons  why  such  restart  operations  help  speed  up  the  learning.  We  introduced  the 
concept  of  a  clauifiable  state  in  order  to  identify  the  learning  stage  where  no  more  restart  operations  are 
necessary.  We  showed  that  a  fiirthn'  speedup  can  be  attained  by  applying  a  different  strategy  from  die  point  of 
the  classifiable  state.  Computer  simulations  indicated  a  significvit  improvenmit  in  the  learning  time  with  our 
proposed  method  fm  XOR,  3-bit  parity  and  4-bit  encoder  probl«Tis.  This  preliminary  results  ^owed  that  the 
new  algoridim  is  about  10  times  futer  than  the  convetdionaf  BP  algoridun. 
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Abstract 

A  new  classifier  neural  network  architecture  and  corresponding  learning 
algorithm  is  introduced.  The  hypothesis  space  for  this  algorithm  is  the  set  of 
all  piecewise  linear  separations.  Properties  that  are  desirable  in  neural 
network  architectures  in  general  are  discussed  and  related  to  the  present 
design.  The  algorithm  is  shown  to  be  consistent  A  means  for  reducing  the  size 
of  die  model  produced  by  the  learning  algorithm,  and  a  rationale  for 
performing  such  a  reduction  are  given. 


1.  Introduction 

In  this  paper,  a  classifying  neural  network  and  learning  algorithm  which 
generate  piecewise  linear  models  is  presented.  The  learning  problem  to  which  this 
algorithm  is  applied  is  that  of  learning  concepts,  about  which  no  a  priori  knowledge 
is  available,  except  perhaps  that  the  concept  is  representable  as  a  open  subset  of  the 
input  space.  As  is  the  case  with  most  learning  classifiers,  the  learning  algorithm 
accepts  a  set  of  samples  from  the  input  space  along  with  the  correct  classification  of 
those  samples  and  produces  or  configures  a  network  which  effectively  partitions  the 
input  space  into  regions  which  represent  two  or  more  classes.  Configuration  of  the 
net  includes  actions  such  as  determining  the  number  of  nodes  or  neurons, 
detennining  the  connectivity,  and  determining  the  synaptic  weights. 

The  introduction  of  any  new  learning  algorithm  or  network  should  be 
accompanied  by  a  description  of  its  desirable  properties.  The  standard  used  to 
evaluate  the  network  introduced  in  this  paper  is  summarized  below.  However,  in 
order  to  describe  these  properties,  some  preliminary  definitions  must  be  given. 

nefinition:  Concept  -  A  set  of  points  in  the  network's  input  space.  Given 

samples  and  counter-examples  of  this  set,  a  network's  learning  algorithm  attempts  to 
set  parameters  or  synaptic  weights  in  the  network  such  that  novel  points,  which  are 
also  in  the  concept,  are  recognized. 

nefinition:  Sample  -  A  point  from  the  input  space,  paired  with  information 

regarding  membership  in  some  concept  c. 
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Dennition:  Hypothesis  -  The  set  of  all  points  in  the  input  space  which  are 

classifled  as  belonging  to  the  concept,  after  the  network  has  been  trained. 

Definition!  Hypothesis  Space  -  The  set  of  all  hypotheses  which  can  be 
generated  by  a  given  learning  algorithm  when  presented  with  concepts  from  a 
given  concept  class. 

Definition:  Consistent  Learning  Algorithm  -  A  learning  algorithm  is 

consistent  if  the  network  it  configures  is  guaranteed  to  correctly  classify  all 
members  of  the  example  training  set. 

Desirable  properties  of  classifying  neural  networks  may  now  be  listed  as  follows: 

Property  11  The  hypothesis  space  of  the  learning  algorithm  should  be  large 
enough  to  model  a  broad  range  of  potential  concepts.  While  the  concept  space  of  a 
given  problem  domain  need  not  match  the  hypothesis  space  generated  by  the 
learning  algorithm  used,  a  large  concept  space  does  require  a  correspondingly  large 
hypothesis  space.  An  example  of  a  large  hypothesis  space  is  the  set  of  all  open  sets. 
In  fact,  it  Is  doubtful  that  any  neural  net-like  system  can  produce  a  useful  hypothesis 
If  the  concept  space  is  larger  than  the  set  of  all  open  sets. 

An  example  of  a  small  hypothesis  space  is  the  classic  example  of  linear 
separations  or  half  spaces.  Few  problems  with  interesting  applications  can  be 
reduced  to  such  simplicity.  These  limited  hypothesis  spaces  can  only  be  used  to 
construct  models  for  small  concept  recognition  problems.  A  small  concept  class  is 
indicative  of  having  some  specific  knowledge  of  the  nature  of  the  concept  to  be 
learned.  In  these  cases,  it  may  be  argued  that  a  generally  applicable  and  powerful 
learning  system  is  not  required. 

The  size  of  the  hypothesis  space  is  determined  by  the  nature  of  the  network 
architectures  produced  by  the  learning  algorithm  rather  than  by  the  specifics  of  the 
learning  algorithm  itself.  For  example,  it  is  known  that  multilayer  nets  with  linear 
units  having  sigmoidal  activation  functions  have  a  very  large  hypothesis  space 
(Funahashi,  1989).  However,  this  does  not  mean  that  there  exists  an  algorithm  which 
can  configure  such  a  net  so  that  it  will  reliably  learn  complex  concepts.  Therefore  at 
least  one  additional  property  is  required. 

Property  2)  The  network,  after  training  should  have  a  high  probability  of 
correctly  classifying  samples  from  its  input  space.  This  requirement  is  obvious, 
although  formulating  it  precisely  and  knowing  when  the  requirement  has  been  met, 
is  far  ^m  obvious. 

One  precise  formulation  of  this  property  is  the  notion  of  polynomial  leamability, 
also  known  as  probably  almost  correct  (PAC)  leamability. 

Definition:  Polynomially  Learnable  -  Let  s  be  an  upper  bound  on  alze(c), 

where  tize()  is  some  concept  complexity  mes^ure.  A  concept  class  C  is  polynomially 
leamable  if  there  exists  a  polynomial  time  algorithm  A  that  accepts  samples  from  c  e 
C  according  to  an  arbitrary  probability  distribution  over  the  sample  space,  and 
returns  a  hypothesis  h  (i.e.  the  network  is  trained),  such  that  for  all  0  <  e,  S<  1  and s  > 
1,  there  exists  a  sample  size  m(e,8,s),  polynomial  in  1/e,  1/5  and  s,  such  that  given  a 
random  sample  set  of  size  m(e,5,s),  A  produces,  with  probability  at  least  1-5,  a 
hypothesis  h  that  has  error  at  most  e.  (Valiant,  1984) 
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In  essence,  the  polynomial  learnability  requirement  demands  that  reliable 
learning  can  be  achieved  with  a  practical  num^r  of  examples.  Furthermore,  it 
sensibly  allows  the  number  of  examples  to  grow  as  the  complexity  of  the  concept  or 
the  stringency  of  the  reliability  is  increased. 

The  deflnition  of  polynomially  leamable  concept  classes  is,  of  course,  only  a 
definition  and  therefore  it  does  not  provide  a  means  for  determining  whether  a 
particular  algorithm  can  polynomially  learn  concepts  from  a  particular  concept 
class.  A  theorem  which  may  be  helpful  in  this  regard  is  provided  by  Blumer  et  al. 
(1989).  Blumer's  theorem  states,  in  part,  that  a  concept  class  C  is  polynomially 
leamable  if  there  is  an  Occam  algorithm  for  C 

Deflnition;  Occam  Algorithm  -  A  learning  algorithm  A  is  Occam  if: 

i)  A  runs  in  polynomial  time  p(m). 

ii)  A  is  consistent. 

iii)  Let  be  the  set  of  all  hypotheses  produced  by  A  when  A  is  given  m 
samples  with  respect  to  a  concept  c  in  C,  where  size(c)  <  s.  The  VC 
dimension  (Haussler  and  Welzl,  1987)  of  must  be  less  than  p(s)m®  . 
p(s)  is  some  polynomial  in  s  and  a  is  a  constant  in  [0,1). 

Property  31  The  algorithm  should  run  in  an  amount  of  time  which  makes  use 
of  the  algorithm  practical.  Convention  from  the  theory  of  computing  places  the 
boundary  between  practical  and  impractical  as  being  between  the  polynomial  time 
algorithms  and  the  exponential  time  algorithms.  Time  functions  are  measured  in  the 
input  size,  which  in  our  case  is  the  number  of  samples  provided  for  training. 
Polynomial  time  properties  are  included  in  the  definitions  of  polynomially  leamable 
and  Occam  algorithm  above. 

Combining  desirable  property  (1)  with  Occam  deflnition  items  (ii)  and  (iii),  one 
can  see  that  a  learning  algorithmist  seeks  an  algorithm  with  C^^  large  enough  so 

that  consistency  with  respect  to  a  large  concept  class  is  possible,  yet  C^,,  is  not  so 

large  that  its  VC  dimension  exceeds  p(s)m®.  In  this  paper  it  will  be  shown  that,  no 
matter  how  large  C  is,  the  algorithm  presented  can  always  produce  a  consistent 

hypothesis.  Furthermore,  in  the  spirit  of  reducing  the  VC  dimension  of  C^  the 

learning  algorithm  includes  a  means  of  reducing  the  number  of  linear  separations 
which  make  up  the  entire  piecewise  linear  separation.  The  number  of  linear 
separations  is  related  to  the  VC  dimension.  It  is  not  difficult  to  show  that  the  VC 
dimension  of  the  set  of  all  piecewise  separations,  having  k  linear  separations,  is  > 
k(n+l). 


2.  Piecewise  Linear  Neural  Networks 

A  piecewise  linear  separation  is  a  suitable  approximator  of  certain  concept 
classes.  If  the  concept  is  an  open  set  which  is  separated  from  non-concept  points  by 
piecewise  continuous  manifolds,  or  hypersurfaces,  then  those  manifolds  may  be 
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approximated  by  multiple  bounded  hyperplanes.  The  resulting  set  of  bounded 
hyperplanes  fcxrm  a  separation  which  is  piecewise  linear. 

Algorithms  exist  for  determining  a  single  linear  separation  when  the  sample 
points  are  so  separable  (Rosenblatt,  1962;  Karmarkar,  1984).  These  algorithms  are 
well  known.  In  the  case  of  Karmarkar,  the  algorithm  is  polynomial  time.  The 
primary  difflculties  in  extending  single  linear  separations  to  piecewise  linear 
separations  lie  in  determining  which  linearly  separable  subsets  of  the  sample  set 
should  be  grouped  together  and  in  determining  which  regions  of  the  input  space  are 
associated  ^th  each  linear  separation. 

The  simplest  possible  case  one  may  consider  is  that  of  two  samples,  Xa  and  Xb>  of 
opposite  class,  one  belonging  to  the  concept  and  one  not  belonging.  In  order  to 
approximate  the  surface  which  bounds  the  concept  and  separates  the  point  one  may 
employ  the  following  observation:  a  line  segment  connecting  any  two  points  of 
opposite  type  must  include  a  point  which  is  on  the  surface.  A  point  on  the  segment 
must  be  selected  as  well  as  a  surface  model  type  and  orientation.  Since  the  simplest 
sufficient  model  of  any  phenomenon  tends  to  be  superior,  the  simplest  possible 
assumptions  are  made,  namely:  the  model  surface  type  is  a  hyperplane,  the 
orientation  of  the  plane  is  given  by  normal  vector  Xa‘*b,  the  point  shared  by 
the  h3^rplane  and  the  line  segment  is  chosen  to  be  (Xa+Xb)''2.  Nilsson  (1965) 
generalized  this  idea  from  point  pairs  to  point  cluster  pairs.  The  present  algorithm 
generalizes  the  idea  of  generating  hyperplanes  from  point  pairs  to  the  case  of  an 
arbitrary  number  of  sample  points  which  are  not  necessarily  clustered  and  not 
necessarily  linearly  separable. 


3.  Target  Network  Architecture 

The  networii  architecture  which  is  trained  by  the  learning  algorithm  is  shown  in 
figure  1.  The  learning  algorithm  determines  the  number  of  nodes  as  well  as  the 
synaptic  weights.  There  are  five  node  types  in  the  net:  m,  z,  g,  n,  and  M.  There  is  a 
plur^ty  of  (m,  z,  g,  n)  4-tuples  while  M  is  uniqt^.  The  m,  z,  and  n  nodes  each  store 
a  vector  having  the  same  dimension,  n,  as  the  input  space.  Input  vector  I  enters  the 
system  and  is  distributed  to  all  m  and  z  nodes,  z  computes  A  =  I  -  z  while  m  computes 
the  magnitude  jA|.  Each  m  directs  its  output  |a(  to  node  M  where  the  minimum  of  all 
|a|s  is  computed.  The  minimum  is  then  broadcast  back  to  the  m  nodes  which  compare 
the  fninimum  with  |a|.  If  m's  output  matches  the  minimum,  m  outputs  an  open-gate 
signal  to  gate  node  g.  Gate  node  g  then  passes  z's  output  to  node  p  which  is  a 
perceptron.  Only  one  perceptron  receives  input  in  response  to  net  input  I .  The 

perceptron  performs  the  usual  dot  product  between  its  input  and  the  stored  synaptic 
weight  vector.  The  dot  product  is  compared  to  a  threshold  which  is  always  zero  and 
the  network  outputs  zero  or  one  depending  on  the  comparison.  The  input  to  the 
network  is  tiius  classified  as  belonging  to  a  concept  or  not. 
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4.  Learning  Algorithm 

The  learning  algorithm  proceeds  as  follows: 

1 )  Acquire  N  sample  points  and  their  classiiications : 

S  3  { ( Xi,  Cj)  I  i=l...N,  Ci  =  A  or  B},  the  set  of  classified  samples 
A  3  fXi  I  Ci  =  A  },  the  set  of  all  samples  having  type  A 
B  3  I  Xi  I  Ci  =  B  I,  the  set  of  all  samples  having  tyi^e  B 

2)  For  each  (  Xa,  Ca)  €  S,  find  (  X b>  Cb)  such  that  Ca  *  Cb  and 

IXa-^bl^l^a-^klforall  k  where  Ca  *  Ck.  In  other  words,  find  the 
nearest  neighbor  of  opposite  type.  Call  the  resultant  set  of  pairs  of 
classified  samples  P. 
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3)  Order  the  elements  in  each  pair  [(  Xz,  Ca),  ( ^b.  C’b)]  such  that  the  first 
element  is  a  class  A  pair  and  the  second  is  a  class  B  pair  if  not  already  so 
ordered.  Remove  r^undant  elements  from  P. 

4)  For  each  [(  X  a,  Ca).  ( b.  Cb)]  €  P,  compute  the  normal 


n  a 


and  the  midpoints 


Define  a  set  of  5-tuples  called  candidate  posts 

Csi(Xz.  Xh,  nj,m)] 

comprised  of  the  sample  points  from  each  pair  in  P  with  their 
corresponding  normals  and  midpoints. 

5)  Adjust  the  midpoint  z  of  each  candidate  as  follows: 

At  least  one  of  the  two  following  balls  exists: 

Ga*  imiform  ball  about  X  a,  where  X  a’s  NNOT  is  X  b. 

Gb“  uniform  ball  about  X  b,  where  X  b’s  NNOT  is  X  a- 

a)  If  they  both  exist  then  do  nothing. 

b)  If  the  uniform  ball  Ga  exists,  set 

s  s  inin{(5  -  } 


c) 

d) 


z  is  then  changed  to  jCj  +  — n. 


If  the  uniform  ball  Gb  exists,  set 


e) 


z  is  then  changed  to 
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6)  The  candidate  posts  compete  to  be  included  in  the  final  post  set; 

Definition-  A  candidate  classifies  a  point  x  as  being  in  A  or  in  B  as  follows: 
if  (jt  —  z)  •  n  >  0  then  x  is  classified  as  being  in  A.  Otherwise  x  is  classified 
as  being  in  B.  Note  that  x  being  classified  as  in  A  (or  B)  does  not  necessarily 
imply  that  x  must  acttially  be  in  A  (or  B). 

Definition-  The  popularity  of  a  candidate  is  equal  to  the  number  of  samples  in 
the  largest  ball  centered  at  m  such  that  no  sample  in  the  ball  is  incorrectly 
classified. 

Definition  -  For  a  given  Jr ,  let  G  be  the  largest  ball,  centered  at  x ,  which 

contains  no  candidates  which  misdassify  x.  A  proxy  candidate  for  Jr  is  the 
most  popular  candidate  in  G. 

a)  Compute  the  popularity  of  each  candidate. 

b)  For  each  sample  point  Jr ,  find  a  pro^o'  candidate. 

c)  Remove  any  candidate  from  the  candidate  list  if  it  is  not  a  proxy  for 
some  sample. 

d)  If  some  removals  occur,  go  to  (b). 

e)  The  remaining  candidate  posts  are  the  posts  and  are  used  to  define  the 
weights  of  the  classification  network.  The  vectors  m  are  stored  in  the 
network's  m  nodes,  the  vectors  z  are  stored  in  the  z  nodes  and  the 
vectors  n  are  stored  as  the  weights  of  the  perceptron  nodes  n. 

5.  Consistency 

Definition:  NNOT  -  Nearest  neighbor  of  opposite  type. 

Definition:  Uniform  Ball  of  a  Point-  The  open  ball  centered  at  a  sample  and 

having  radius  equal  to  the  distance  from  that  sample  to  the  sample's  NNOT.  Uniform 
balls  contain  only  samples  of  the  type  found  at  center,  if  any. 

Definition:  Candidate  Post  of  a  Sample  Point  -  In  the  algorithm,  each  sample 

point  Jr  generates  a  post  P  by  first  defining  a  NNOT  for  ^ .  x  and  the  NNOT  then 
define  a  i^dpoint  and  so  on.  In  this  manner,  each  sample  has  assigned  to  it  a  unique 
candidate  post.  P  is  referred  to  as  the  candidate  post  of  sample  Jr.  If  the  candidate 
survives  the  competition  to  become  a  post,  then  P  is  referred  to  as  the  post  of  a  sample 
point  Jr. 

Definition:  Uniform  Ball  of  a  Post-  Let  Jfa  and  Jcb  be  the  samples  which 

generate  post  P.  From  the  algorithm  for  the  construction  of  a  post,  we  know  that 
there  is  a  ball  of  radius  iJra-^bl  and  centered  at  either  Jr  a  or  Jtb  which  uniformly 
contains  points  of  the  center  type.  This  ball,  of  which  there  is  at  least  one  and 
possibly  two,  is  the  uniform  ball  of  the  post. 
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Theorem: 


The  learning  algorithm  is  consistent . 


Proof: 

By  lemma  1,  the  networit  which  may  be  generated  using  candidate  posts  defined  after 
z  adjustment  and  before  candidate  competition  is  consistent.  It  remains  to  show  that 
the  process  of  candidate  competition  preserves  consistency.  To  see  that  this  is  the 
case,  note  that  in  the  algorithm,  for  each  sample  the  corresponding  ptojqt  is  chosen 
such  that  no  other  candidate  which  is  closer  to  the  sample  will  misclassify  it. 
Therefore,  if  some  other  candidate  is  chosen  as  a  pro^Q'  for  some  other  sample  point, 
and  if  that  proTQ'  is  closer  to  die  original  sample  than  its  own  proi^,  the  new  proiQ' 
will  also  correcdy  classify  the  original  sample.  Hence,  the  closest  post,  as  determined 
by  m,  correcdy  classifies  each  sample.  This  is  all  that  is  required  for  consistency  of 
the  netwoiit. 
aED. 

I^mma  1:  In  the  algorithm,  following  the  adjustment  of  the  z  vectors,  a  network 
formed  using  the  candidate  posts  as  weights  is  consistent. 

Proof: 

A  sample  will  always  be  correctly  classified  by  its  own  post.  Therefore,  for 
misclassidcadon  to  occur,  a  sample  must  be  closer  to  the  m  point  of  another  post  and 
that  other  post  must  misclassify  the  sample.  Such  potential  situations  may  be  broken 
down  into  two  cases. 

Case  1)  The  uniform  balls  of  Pj  and  are  of  the  same  type.  Let  Pj  be  the  post  which 
potentially  misclassifies  sample  x^.  Both  Pj  and  x^  have  associated  with  them  a 
uniform  ball.  If  these  uniform  balls  are  of  the  same  type,  then  by  lemma  2,  X^  is  in 
the  uniform  ball  of  Pj.  Since  the  z  vector  of  Pj  has  been  adjusted  to  ensure  that  all 
points  in  Pj's  uniform  ball  are  properly  classified,  x^  is  properly  classified. 

Case  2)  The  uniform  balls  of  Pj  and  x^  are  of  different  types.  By  lemma  3,  x^ 
cannot  be  closer  to  ihj  than  it  is  to  m^.  Therefore  there  can  be  no  misclassification. 

Since,  in  both  cases  there  is  no  misclassification,  the  network  is  consistent. 
aED. 

Lemma  2:  Let  (3c^,Jc**,m*,n*,z*)  and  (3c^y,3c^,my,n^,Zy)  be  candidate  posts 
produced  by  the  algorithm.  If  has  a  uniform  ball  and  if  x^  is  closer  to  ihj  than  it  is 
to  mj^,  and  if  x^  has  a  uniform  ball,  then  x^  is  in  the  uniform  ball  of  x^. 

Proof  by  contradiction:  Let  s  Likewise  for  x^,and  x^j.  Let 
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*)  Assume  that  is  closer  to  rhj  than  it  is  to  m,^  and  is  not  in  the  uniform 
ball  of  x^. 

By  the  definition  of  a  midpoint, 

Without  loss  of  generality  one  may  place  x^  at  the  origin,  x^  is  closer  to  fhj  than  it 
is  to  may  be  expressed  as: 

2)  or 

By  the  definition  of  "uniform  ball": 

3)  >'■<»■  >  »■’ 

X^  is  not  in  the  uniform  ball  of  x^  can  be  written  as: 


Combining  (1)  &  (2)  gives: 

5) 

i 

<  r*  /  4  or 

£((jt.j,  +A:,j^  )f  <r* 
1 

Expanding  (4)  &  (S): 

6) 

i  i 

7) 

Z[W 

1 

+W]<'' 

2 

Add 

(3)  to  (6): 

8) 

i  i 


Which  implies: 

i 


Using  (3),  (9),  and  the  fact  that  x^j/  is  positive,  one  has: 

10)  ![•*«/ 


which  contradicts  (7).  Therefore,  statement  (*)  must  be  false. 

aED. 
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Lemma  3:  Let  (  and  ( be  candidate  posts 

produced  by  the  algorithm.  If  x^  has  a  uniform  ball  and  if  x^j  has  a  uniform  ball, 
then  x^  cannot  be  closer  to  rhj  than  it  is  to  m^. 

Proof  by  Contradiction: 

The  radius  of  ic^'s  uniform  ball  must  be  less  than  j.  likewise  for  jc^^  's 

uniform  ball.  By  the  definition  of  midpoint,  is  on  the  sphere  centered  at  x^  and 
of  radivis  jic^  -  jCj|^  |/2.  Call  the  interior  of  this  sphere  B.  If  x^  is  closer  to  than  it 

is  to  then  Mj  must  be  in  the  ball  B.  However,  Mj  in  B  implies  jjCjy  ~X^\> 

[x^  —  3C|y|.  This  would,  in  turn,  imply  that  x^  is  in  the  uniform  ball  of  x^-.  Since  this 

is  impossible,  x^  must  not  be  closer  to  than  it  is  to  m^. 

aED. 


6.  Conclusion 

A  learning  algorithm  for  a  piecewise  linear  network  has  been  presented.  It  has 
been  proven  that  this  algorithm  can  consistently  learn  an  arbitrary  training  set 
from  a  large  concept  class.  While  this  paper  stops  short  of  proving  the  algoritl^  is 
Occam,  it  does  show  one  means  of  reducing  the  model  size,  forthcoming  work,  the 
algorithm  will  be  shown  to  be  polynomial.  Future  efforts  will  also  include  an  attempt 
to  show  that  this  algorithm,  or  a  variant  of  it.  is  an  Occam  algorithm. 
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ABSTRACT  -  A  fast  and  robust  algorithm  is  presented  for  training  multilayer  feedforward  neural 
networks  as  an  alternative  to  the  backpropagation  algorithm.  The  number  of  iterations  required  by  the  new 
algorithm  to  converge  is  less  than  10%  of  what  is  required  by  the  backpropagation  algorithm.  Also,  it  is 
less  affected  by  the  choice  of  initial  weights  and  setup  parameters. 

The  algorithm  uses  a  modified  form  of  the  backpropagation  algorithm  to  minimize  the  mean-squared 
error  between  the  desired  and  actual  outputs  with  respect  to  the  inputs  to  the  nonlinearities.  This  is  in 
contrast  to  the  standard  algorithm  which  minimizes  the  mean-squared  error  with  respect  to  the  weights. 

The  new  algorithm  will  be  called  "Predictor  of  Linear  Output"  (PLO),  in  terms  of  its  function. 

Estimated  linear  signals,  generated  by  the  modified  backpropagation  algorithm,  are  used  to  produce  an 
updated  set  of  weights  through  a  system  of  linear  equations  (which  has  an  easy  resolution)  at  each  node. 


I.  INTRODUCTION 


The  feedforward  neural  networks  are  used  in  a  number  of  applications,  e.g.,  control,  see  [6]  and  [7]. 
Because  of  the  hidden  layers,  they  have  overcome  many  limitations  of  single  layer  perceptrons.  These 
types  of  networks  are  trained  ahead  of  time,  using  known  input/output  data.  Once  trained,  the  network 
weights  are  frozen  and  unknown  data  can  be  run  through  the  network. 

The  classical  method  for  training  a  multilayer  perceptron  is  the  backpropagation  algorithm  1 1  )-[3]  which 
is  an  iterative  gradient  algorithm  designed  to  minimize  the  mean-squared  error  between  the  desired  output 
and  the  actual  output  for  a  particular  input  to  the  net. 

Although  it  is  successfully  used  in  many  cases,  the  backpropagation  algorithm  suffers  from  a  number 
of  shortcomings.  One  such  shortcomings  is  the  rate  at  which  the  algorithm  converges.  Many  iterations  are 
required  to  train  small  networks  for  even  the  simplest  problems.  For  large  network  structures  and  data 
sets,  it  may  take  days  or  weeks  in  order  to  train  the  network.  A  training  algorithm  that  reduces  this  time 
would  be  of  considerable  value. 

It  is  the  purpose  of  this  paper  to  present  a  new  alternative  algorithm  which  is  considerably  faster  than  the 
backpropagation  algorithm  [lI-[3]  and  has  the  added  advantage  of  being  less  affected  by  poor  initial 
weights  and  setup  parameters  (another  shortcoming  of  the  backpropagation  algorithm).  Besides,  the  new 
algorithm  is  robust  and  much  simpler  to  build  and  to  understand  than  another  modifications  of 
backpropagation  [4]  with  less  computational  complexity  and  more  speed  of  convergence  and  quality  of 
output. 

Estimated  linear  signals,  generated  by  the  modified  backpropagation  algorithm,  are  used  to  produce  an 
updated  set  of  weights  through  a  system  of  linear  equations  (which  has  an  easy  resolution)  at  each  node. 

Training  patterns  are  run  through  the  network  until  convergence  is  reached. 


II.  PREDICTOR  OF  LINEAR  OUTPUT 
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Before  beginning  with  the  functional  description  of  the  algorithm,  let  us  state  some  pertinent  definitions 
on  the  basis  of  Fig.  1 ; 

SAE  =  sigmoid^  adaline  element 

p  =  learning  rate.  Its  selection  controls  stability  and  speed  of  convergence 
k  =  time  index  or  the  cycle  number 
Also  (.)^  denotes  the  transpose  of  a  vector  or  a  matrix. 

For  m  simale  SAE:  (in  Fig.  1  (a)) 
d^  =  desired  response  at  time  k 
W,=  {Wo,  k>w.,  >w„  is  the  weight  vector  at  time  k 

=  [+ 1  .x.,  ^, . . .  ,x^  is  the  present  input  pattern  vector  (IP  V)  at  time  k 

s„  =  linear  output  at  time  k,  being  s^=  WJX^=  XJW^ 
y^=  sigmoid  output  at  time  k,  being  y^=  sgm(s^) 

^k~  ~  ^k sigmoid  error  at  time  k 

For  OH  isolated  SAE  of  o  mutiitaver  met:  (in  Fig.  1(b)) 

X^»  =  (+ 1  X,  ^‘'1,  Xj  . . . ,  x^  ^<'0^  is  the  pattern  vector  of  i**  layer  at  time  k;  e.  g. ,  for  the  output  of  the  j**"  SAE, 


will  be 

x^^®=sgm(Sj^«)  (4) 

'^i.k®  “  (^io.k*'*’  *ii.k''’>  '"^j2.k*'**  •  ■  •  ’  '''jn.k*'^)^  Weight  vcctof  of  i***  layer  and  SAE  at  time  k. 

^k®"  (^i.k*'’*^2k''*»  •••  ,  is  the  back-propagated  error  vector  to  the  i*  layer  at  time  k. 

*k®®  (*1  k*"’  ®2k*'’’  •  •  •  »  ^nk^y  is  the  linear  output  vector  ofthei***  layer  at  time  k;e.g.,  for  the  j*"  SAE,  it  will  be 

=  (5) 

Note:  In  both  cases,  the  first  premise  is  to  randomize  all  present  weights. 

I.-  PLO  applied  to  a  sinale  SAE:  in  Fig.  I  (a),  we  can  see  the  implementation  of  the  PLO-algorithm  for 

a  single  neuron  in  detail.  The  algorithm  is  s^.,,  ^  ^k^  P  ^k  (^) 

where  (7) 

is  the  instantaneous  error  gradient  for  this  element  with  respect  to  linear  output  "s^”,  and 

E,=  V4(€k)^  (8) 

is  the  mean  square  of  the  sigmoid  error.  Therefore,  replacing  Eq.(8)  in  Eq.(7),  we  have; 

V,=  -  ■/*  = 

=  sgm’(Si^)  (9) 

This  particular  gradient  is  coincident  with  the  square  error  derivative  associated  to  a  single  SAE  (see  pp. 
1434  in  [1]),  8,=  -  Vz  r?(€^)2  /r7s,= 

=  sgm’(s^)  (10) 

Such  as,  Eq(9)  and  Eq(lO)  are  similars,  substituting  Eq(lO)  into  Eq(6)  gives 


'The  term  “sigmoid”  is  usually  used  in  reference  to  monotonically  increasing  “S-shaped”  functions,  such  as  the 
hyperbolic  tangent.  The  sigmoid  function,  will  be  represented  with  the  term '  *  sgm  ‘ ' ,  while  ‘  ‘  sgm'  ‘  ‘  will  be  its  derivative 
wich  respect  to  s^(linear  output). 


(1) 

(2) 

(3) 
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S|..i-Sk+J»Sk  = 


=  s^+|i2^sgm’(s^)  (II) 

besides,  Sk«,  =  W  J,  X,,,  =  Xj,  W,,,  (12) 

and  being  [X^^^  ^  square  and  positive  semidefmite  matrix,  then,  by  equational  handling  we  will  have 

WkM  =  Sk*iXkM/|Xk.J^  (13) 

II  -  PLO  applied  to  an  isolated  SAE  of  multilayer  net:  in  Fig.  1(b),  we  can  see,  in  detail,  the 
implementation  of  the  PLO-algorithm  for  an  isolated  neuron  of  multilayer  net.  This  technique  is  based  on 
the  square  error  derivative  associated  with  the  j***  neuron  in  the  i***  layer  (see  pp.  1433-1434  in  ( 1 1), 

=  sgm’(Sj^^®)€j^®  (14) 

(15) 

is  the  back-propagated  error  up  to  the  j***  neuron  of  the  i***  layer  ;  and 

NM  N« 

S  S  O"  (16) 

The  instantaneous  sum  squared  error  (e  is  the  sum  of  the  squares  of  the  errors  at  each  of  the 
outputs  of  the  network  ’  and  each  one  the  desired  responses  ‘  ‘d„^’  ’  for  the  n*^  layer,  i.e. ,  the  last  layer. 
N*'*'*)  will  be  the  number  of  neurons  of  the  layer  (i+1)*^.  Finally,  the  estimate  linear  signal  of  the  neuron 

j^and  of  the  i*  layer  will  be  p6j^<'>  =  Sj,,®+  ntj^(‘>sgm’(Sj,,‘'’)  (17) 

Here  too,  being  ^  square  and  positive  semidefinite  matrix,  then, 

(18) 

III.  PATTERN  RECOGNITION 


We  consider  the  training  of  a  neural  network  to  recognize  patterns  presented  to  its  input.  Although  many 
different  experiments  were  performed  with  several  data  sets  and  different  networks  structures,  it  is 
presented  here  only  a  limited  number  because  of  space  considerations. 

The  input  pixels  are  set  to  a  level  of  -0.5  or  +0.5.  It  is  important  to  note  that  these  levels  are  considered 
to  be  analog  values  rather  than  digital  binary  values.  Although  it  is  presented  here  experiments  with 
patterns  of  two  levels,  similar  results  are  obtained  with  patterns  of  various  gray  levels. 

The  output  is  likewise  treated  as  analog.  Hierefore,  the  network  is  considered  trainee  not  only  when  the 
output  agrees  in  sign  with  the  desired  output  but  also  in  absolute  value. 

The  7x7  pixel  patterns  in  Fig.2  are  the  inputs  to  a  2  layer  feedforward  perceptron  with  16  nodes  in  the 
hidden  layer.  The  desired  output  of  the  network  is  a  2,  3,  or  4  binary  word  depending  upon  the  number  of 
patterns  used  to  train  the  network. 

Fig.3  shows  the  mean-squared  error  versus  the  iteration  number  for  both  algorithms  during  training  for 
the  7  X  7  example.  Table  I  presents  numerically  the  performance  comparison  of  the  two  algorithms  shown 
in  Fig.3.  This  comparison  takes  into  account  the  computational  efficiency  of  each  algorithm  as  well  as  the 
number  of  iterations  required  for  the  algorithm  to  reach  a  specified  mean-squared  error.  The  result  is  a 
time  ratio  of  the  two  algorithms  when  run  on  a  sequential  machine.  A  mean-squared  error  convergence  of 
slightly  less  than  0.25  was  chosen  since  this  value  is  the  maximun  that  can  be  used  and  still  produce  correct 
results,  assuming  that  the  outputs  are  eventually  passed  through  a  hard  limiter  to  produce  a  binary  word. 

The  algorithm  has  been  implemented  in  TurboC++ 3.0(C),  Borland  (R),  on  a  Pentium''^  Processor-66 
Mhz  PC/AT. 
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IV.  CONCLUSIONS 


It  has  been  presented  in  this  paper  a  new  al^c:  .'hm  wich  is  faster  and  robuster  than  the  standard 
backpropagation  algorithm  in  training  in  training  inultilayer  perceptrons. 

The  algorithm  presented  here  convergences  in  less  than  1 0%  of  the  time  it  takes  for  the  backpropagation 
algorithm. 

Testing  performed  on  3  layer  networks  and  networks  with  more  neurons  per  layer,  had  equally  impressive 
results.  In  one  experiment  (not  shown)  with  39  neurons  in  first  and  second  layers  and  1  neuron  in  the 
output  layer,  the  backpropagation  algorithm  took  approximately  20000  iterations  to  reach  the  same  mean- 
squared  error  that  the  new  algorithm  achieved  in  200. 

Also,  the  new  algorithm  is  more  predictable  in  its  training.  In  Fig.3,  it  be  notice  that  the  backpropagation 
algorithm  tends  to  reach  a  certain  mean-squared  error  and  remain  there  for  quite  a  while  making  little  or 
no  progress. 

At  some  point,  it  either  rapidly  converges,  or  jumps  to  a  new  level  where  it  would  again  make  little  or 
no  progress  for  quite  a  while.  In  contrast,  the  new  algorithm  continues  to  make  steady  progress  toward 
improving  the  mean-squared  error  throughout  the  training  period. 

Finally,  the  convergence  of  the  backpropagation  algorithm  depends  heavily  on  the  magnitude  of  the 
initial  weights.  If  chosen  incorrectly,  the  algorithm  takes  a  long  time  to  converge.  The  new  algorithm 
seems  to  be  less  sensitive  to  the  initial  weight  setting. 
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Table  I  -  Improvement  Ratio  of  the  New  Algorithm  over  BackPropagation 
with  the  MSE  Convergence  Set  at  0.2S. 


Fig.  3  -  Learning  curves  for  a  2  layer  pattern  recognition  network  with  16  nodes 
in  the  hidden  layer.  Seven-by-seven  patterns  were  used  for  training. 


in-606 
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ABSTRACT: 

Tlus  paper  reports  on  computational  experience  with  a  quasi-Newton  based  strategy  for  the 
supervised  training  of  feed  forward  neural  networks.  Coupled  with  a  generalized  formulation  of 
the  logistic  function  and  explicit  box  constraints  on  network  variables,  it  is  shown  that  up  to  three 
orders  of  magnitude  fewer  function  ev^uations  than  the  delta  rule,  and  up  to  an  order  of  magnitude 
fewer  function  evaluations  than  certain  conjugate  gradient  implementations  can  be  obtain^. 

INTRODUCTION 

The  ability  of  neural  networks  to  generalize  based  only  on  a  set  of  training  data  has  been 
extensively  documented  in  the  open  literature  in  recent  years  (see  for  example  Rumelhart  1986 
[5]).  Supervised  training  of  a  feedforward  neural  network  is  usually  achieved  through  the  solution 
of  an  appropriate  nonlinear  program.  Subsequently  training  times  are  affected  by  the  nonlinear 
programming  algorithms  used.  Some  of  these  algorithms  are  (i)  simulated  annealing,  (ii)  the  delta 
rule,  (iii)  conjugate  gradients,  (iv)  heuristics,  (v)  Kalman  filter  techniques  and  (vi)  Newton-based 
strategies. 

The  overall  goal  of  this  paper  is  to  address  the  training  of  the  feedforward  network  using 
successive  quadratic  programming.  In  the  suggested  framework,  one  can  handle  network  training 
while  incorporating  explicit  bound  (or  box)  constraints  on  key  network  parameters  such  as  weights. 
In  the  oral  presentation,  I  will  also  report  on  the  implementation  of  the  algorithm  on  a  network  of 
workstations  using  Scientific  Computing’s  Linda  parallel  processing  software. 

THE  FEED  FORWARD  NETWORK 

A  typical  feed  forward  neural  network  consists  of  s  layers  of  neural  elements  (an  input  layer, 
s  —  2  hidden  layers  and  one  output  layer)  as  illustrated  in  Figure  1  with  s  =  3.  In  the  7-th  layer 
there  are  Mj  processing  elements  which  are  interconnected  with  elements  in  the  {j  -  l)-th  and 
(j  +  l)-th  layers.  Associated  with  the  interconnection  between  the  jb^_i-th  element  of  layer  (j  -  1) 
and  the  kj~tb  element  of  layer  j  is  the  weighting  factor 

A  generalized  logistic  function,  a  which  maps  the  cumulative  input,  X  to  the  output,  V  of  a 
processing  element  is  defined  as  follows: 

Y  =  <t(X)  =  {1  +  -  C  (1) 

In  this  paper,  P{X)  is  restricted  to  the  family  of  polynomials.  Thus  P{X)  =  The 

choice  of  m  =  1,  ai  =  —1  and  C  =  0.5  corresponds  to  the  more  conventional  form  of  the  logistic 
function.  Here  ag  corresponds  to  the  bias  term.  A  subset  of  the  a' 3  together  with  the  network 
weights,  Wij,  may  be  used  as  training  parameters  (i.e.  design  variables). 

The  choice  of  m  =  odd  integer  results  in  a  monotonic  basis  function.  In  contrast,  the  choice  of 
m,  even  integer,  results  in  a  non  monotonic  (bell  shaped)  basis  function.  The  bell  shaped  function 
(from  preliminary  trials)  seems  to  do  better  on  two  dimensional  pattern  recognition  problems  in 
which  points  (members)  in  a  class  span  non  contiguous  regions  (see  for  example  Example  3).  The 
properties  of  the  new  basis  are  still  being  investigated  and  will  be  reported  fully  in  another  paper. 

Let  us  define  a  as  follows 


Y  =  a{X)  =  (1  +  -  C 

where  P[X)  is  a  function  of  X. 


(2) 
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Network  training  is  achieved  by  solving  the  following  nonlinear  program  for  the  optimal  values 
of  Q  which  is  a  subset  of  network  parameters,  C,  W  (weights)  and  parameters  associated  with 
/>(Jr).Thus 


min  J 
o 

= 

N  M. 

(3) 

V 

n=l  t.=l 

subject 

to 

YnM 

= 

^,ki 

(4) 

yn,k, 

= 

<^{Xn.k,) 

(5) 

Xn,ki 

= 

(6) 

Qlouter  ^ 

Q 

^  Qupper 

(7) 

where  training  data  set  (input  to  As^-th  element)  and  [Q lower,  Q upper]  3Je  bounds  on 

Q.  In  ad^tion,  the  performance  objective,  J  is  the  sum  of  squares  of  the  deviation  of  the  network 
output,  Ynjt,  from  the  expected  or  desired  output,  Yn,k,-  At  the  solution  of  the  nonlinear  program, 
the  two  outputs  are  close  to  each  other.  There  are  several  variations  of  the  form  of  J.  For  example, 
the  root  mean  square  deviation  and  the  weighted  residual  have  been  used  in  the  open  literature. 

The  inclusion  of  the  simple  bounds  on  the  training  parameters  leads  to  better  conditioning  of 
the  training  algorithm  since  the  weights,  for  instance,  do  not  become  excessively  large. 

QUASI-NEWTON  BASED  TRAINING 

Typically  a  feedforward  network  is  trained  using  the  delta  rule  (or  error  backpropagation 
(Rumelhart,  1986  [5]).  A  different  approach  is  taken  in  this  paper.  Here,  the  analytic  expres¬ 
sions  which  have  been  derived  for  the  gradients  of  the  feedforward  network  (see  Appendix)  have 
been  used  directly  to  solve  the  feedforward  training  problem  as  posed  in  Equation  3.  Gradient  in¬ 
formation  for  the  training  algorithm  was  obtained  two  ways,  by  adjoints  ana  by  perturbation.  The 
adjoint  method  was  at  least  an  order  of  magnitude  faster  than  perturbation  by  central  differences. 

The  solution  procedure  uses  a  Successive  Quadratic  Programming  algorithm  (Han,  1977  [2], 
Biegler,  1985  [1])  that  employs  low  rank  hessian  updating  schemes  like  BFGS.  For  convenience,  I 
will  refer  to  the  SQP  training  of  the  feedforward  network  with  the  new  logistic  function  as  SQPN. 
The  training  algorithm  is  illustrated  in  Figure  2. 

TEST  PROBLEMS 

Unfortunately,  due  to  space  limitations,  only  a  selection  of  example  problems  are  presented  and 
discussed.  In  the  following  test  problems,  a  subset  of  the  a's  together  with  the  network  weights, 
Wij,  have  been  used  as  training  parameters  (i.e.  design  variables).  In  particula  ,  I  have  consistently 
employed  oq  as  a  design  variable  since  it  plays  the  role  of  the  bias.  Each  simulation  was  performed 
on  a  SUN  SparcStation,  an  IBM  RS/6000  model  320H  or  model  530H.  Unless  otherwise  stated,  the 
initial  values  for  the  design  variables  were  randomly  generated  using  the  time  of  day  as  the  seed. 

Example  1:  Parity  Problem 

In  the  parity  problem  (Makram,  1989  [3]),  the  output  is  required  to  respond  with  a  positive 
sign  for  an  odd  number  (JV)  of  -f-l's  inputs  and  with  a  negative  sign  otherwise.  Thus  there  are  2^ 
training  sets  for  the  JV-parity  problem.  The  problem  was  solved  for  N  =  2, 3, 4,  and  5  respectively. 
The  network  consisted  of  the  input  layer  {N  neurons),  one  hidden  layer  {N  -f-  1  neurons)  and  an 
output  layer  (1  neuron).  For  comparison,  m  was  chosen  to  be  1  and  Oi  was  set  to  -1.  In  addition, 
the  weights  and  oq  were  used  as  training  parameters. 

Tame  1  summarizes  the  results  of  several  training  runs  using  N  +  1  hidden  nodes,  the  same 
number  used  by  Makram-Ebeid^  et  al.  SQPN  yields  up  to  five  times  fewer  function  evaluations 

^The  values  repotted  for  SQPN  are  averaged  over  a  few  runs  from  random  initial  points.  BCG  is  Bounded 
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(network  sweeps)  than  the  Bounded  Conjugate  Gradients  approach  of  Makram-Ebeid  et  al.  In 
addition  SQPN  results  in  up  to  2300  times  fewer  function  evaluations  than  their  implementation 
of  the  delta  rule. 

Table  2  compares  the  use  of  oq  to  that  of  oi  as  design  variables,  oq  does  better  than  ai  for  the 
parity  examples.  This  may  be  due  to  the  greater  sensitivity  of  the  logistic  function  to  ax  (second 
term  in  polynomial  tends  to  dominate  first  term)  compared  to  oq.  Thus  the  latter  may  lead  to 
better  conditioned  training  than  Qx- 

Example  2:  Nonlinear  Identification 

This  nonlinear  identification  example  was  proposed  by  Narendra  and  Parthasarathy  [4].  Here 
the  network  is  trained  to  identify  the  following  nonlinear  model  of  a  plant. 


yp(*  +  1) 
/(yp(*:).yp(*:-  1)] 


/[yp(*).yp(*-  !)]  +  “(*) 
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Training  sets  were  made  up  by  taking  one  hundred  samples  of  u(k)  (assuming  that  u{k)  is  i.i.d 
random  signal  uniformly  distributed  in  the  interval  [-2,2])  and  evaluating  the  corresponding  yp. 
The  network  was  trained  to  predict  [Vpfjk  -b  1)]  given  [yp(k  -  l),yp(k)]  using  a  2-20-10-1  network 
(i.e.  2  inputs,  2  hidden  layers  with  20  and  10  elements  respectively,  and  1  output).  Narendra  et 
^  employed  the  same  architecture  and  used  only  weights  (total  of  250  variables)  as  the  decision 
variables.  Thus  in  order  to  do  a  fair  comparison  I  used  only  weights  as  the  decision  variables. 

I  trained  the  network  by  repeatedly  using  data  from  the  first  100  time  steps.  Thus  the  network 
saw  a  smaller  set  of  training  data  than  was  used  by  Narendra  et  al.  who  trained  the  network 
for  100000  time  steps  in  order  to  obtain  good  prediction.  From  Table  3,  just  over  300  function 
evaluations,  equivalent  to  31100  (311  times  100)  time  steps,  were  needed  to  make  the  objective 
(square  error  =  0.004)  small  enough  for  good  prediction.  The  trained  network  was  validated  by 
letting  it  predict  yp  for  the  next  100  time  steps.  From  Figure  3,  the  prediction  is  very  good. 

Example  S:  2>D  Pattern  Recognition 

In  this  two  dimensional  example  (Shah,  1990  [6])  points  are  classified  as  belonging  to  class  A 
or  B  as  illustrated  in  Figure  4. 

Five  hundred  training  points  were  randomly  generated  for  classes  A  and  B  respectively.  This 
pattern  recognition  problem  is  a  nonlinearly  separable  problem  that  requires  two  hidden  layers 
according  to  Kolmogorov.  The  problem  was  solved  using  two  hidden  layers,  and  ten  elements  in 
each  hidden  layer. 

Shah  et  al.  solved  this  problem^  using  three  training  methods,  namely  the  error  backpropaga- 
tion  (delta  rule),  the  global  extended  Kalman  filter  algorithm  and  the  multiple  extended  Ksdman 
filter  algorithm.  They  used  network  output  of  0.9  (0.1)  for  points  belonging  to  class  A  (B). 

In  addition  to  using  [0.9,  0.1]  for  [A,  B],  [0.5,  -0.5]  were  used  for  another  set  of  runs.  Table  4 
(also  Figures  5  and  6)  summarizes  results  from  the  literature  (Shah  et  al.)  and  ours.  Values  had 
to  be  estimated  from  the  plots  given  in  the  literature  in  order  to  make  y)ecific  comparisons.  The 
multiple  extended  Kalman  filter  algorithm  performed  about  as  well  as  SQPN  at  the  25-iteration 
mark.  No  data  beyond  the  25-iteration  mark  was  provided  in  the  literature. 

The  table  shows  that  SOPN  reduces  the  square  error  twice  as  much  as  either  error  backprop- 
agation,  the  global  extended  Kalman  filter  and  the  multiple  extended  Kalman  filter.  The  table 
and  Figures  5  and  6  also  show  that  a  polynomial  index  of  2  (in  the  logistic  function)  speeas  up 
convergence  compared  to  an  index  of  1  for  this  particular  example.  Similarly,  the  use  of  [0.5,  -0.5] 
{C  =  —0.5  in  logistic  function)  to  represent  membership  in  region  A  or  B  is  worse  than  the  use 
of  [0.9,  0.1]  (C  =  0  in  logistic  function).  Since  -0.5  and  0.5  are  asymptotic  values  of  the  logistic 
function,  their  use  to  represent  class  membership  leads  to  relatively  large  values  of  the  weights, 
thus  resulting  in  numerical  problems. 


Conjugate  Gradients.  Makiam  et  al.  attained  Ttuning  Objective  values  close  to  0.01. 

^Account  is  taken  of  the  fact  that  our  objective  function  expression,  J  is  twice  that  of  Shah  et  al. 


DISCUSSION  AND  FUTURE  WORK 


The  quasi-Newton  training  strategy  have  been  shown  to  yield  significantly  fewer  function  it¬ 
erations  in  the  training  of  the  feed  forward  network  than  similar  strategies  described  in  the  open 
literature.  Since  CPU  times  were  not  reported  in  the  literature,  and  since  invariably  differen. 
computing  platforms  were  employed  by  various  researchers  for  the  simulations,  it  is  impossible  to 
compare  the  performance  of  the  algorithms  in  terms  of  CPU  times.  However,  the  number  of  func¬ 
tion  evaluations  is  an  appropriate  and  acceptable  measure  of  speed  and  for  all  practical  purposes 
independent  of  the  computing  platform. 

The  admint  method  of  evuuating  the  gradients  lends  itself  to  implementation  on  a  parallel 
computer.  The  generalized  logistic  function  introduced  in  this  paper  shows  considerable  promise 
in  this  research  effort  into  neural  network  speed  up.  The  properties  of  this  novel  function  are  still 
research  issues  that  are  being  resolved. 

Box  constraints  on  the  decision  variables  were  included  as  explicit  constraints  for  the  following 
reasons;  (i)  numerical  difficulties,  for  example  floating  underflows  due  to  large  network  parameters 
and  poorly  scaled  training  sets,  are  minimized  and  (ii)  since  they  are  linear,  once  they  are  satisfied  at 
the  first  iteration,  they  will  be  satisfied  for  subsequent  iterations.  The  main  disadvantage  with  using 
explicit  constraints  is  the  large  set  of  constraints  that  may  result.  To  get  around  the  dimensionality 
problem,  logarithmic  barrier  function  and  penalty  function  approaches  lump  these  box  constraints 
with  the  objective  function.  However,  the  log  barrier  approach  will  either  avoid  the  bounds  on  the 
decisions  or  get  very  close  to  the  bounds  at  the  risk  of  introducing  numerical  difficulties.  With 
pen^ty  function  approaches,  often  a  trade  off  is  made  between  satisfying  the  box  constraints  and 
minimizing  the  original  objective.  If  a  proper  adaptation  scheme  is  not  chosen  for  the  penalty 
parameter,  then  poor  training  will  result. 

One  main  disadvantage  that  one  can  anticipate  in  SQPN  is  that,  since  storage  of  Hessian 
information  is  required  in  the  SQP  approach,  it  is  expected  that  for  large  networks  (on  the  order 
of  perhaps  1000  weights),  the  quasi-Newton  based  approach  will  not  be  feasible.  Limited  memory 
quasi-Newton  methods,  as  well  as  conjugate  gradients  with  thrust  region  approaches  are  being 
investigated  for  such  large  scale  problems.  I  am  also  looking  into  decomposing  the  hessian  into 
reasonably  sized  submatrices. 

Although  not  discussed  in  this  paper,  analytic  expressions  have  been  derived  for  the  hessian  of 
the  feedforward  network.  These  are  being  used  directly  in  SQPN  to  reduce  the  need  to  employ  low 
rank  hessian  updating  schemes  like  BFGS. 
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APPENDIX 

Proposition  1  The  gradient  of  performance  objective  ,  J,  with  respect  to  the  decision  variables  is 
given  by 
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In  the  interest  of  space  the  proof  will  appear  in  another  paper,  and  can  be  obtained  from 
the  author.  The  nadients  obtained  above  have  been  compared  with  calculations  via  perturbation 
(finite  differences)  and  the  accuracy  of  the  results  agree. 
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NEURAL  NETWORKS  SHARING  KNOWLEDGE  AND  EXPERIENCE 
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Abstract 

In  this  paper  a  new  neural  network  architecture  suitable  for  building  robust  neurocon¬ 
trollers  is  proposed  and  tested.  The  concept  of  the  new  architecture  is  developed  around 
the  idea  of  sharing  intelligence  of  a  neural  network  between  knowledge  and  experience. 
The  architecture  consists  of  two  specialized  sub-networks;  one  provides  the  knowledge 
and  the  other  provides  experience.  The  two  sub-networks  are  connected  in  a  feedforward 
conSguration  to  form  an  intelligent  neural  network.  The  advantage  of  such  a  neural  net¬ 
work  architecture  is  that  the  experience  sub-network  constantly  updates  the  knowledge 
sub-network  with  experience  information  creating  dynamic  intelligence.  This  makes  the 
overall  neural  network  adaptively  changing  its  knowledge  to  accommodate  changes  in  the 
environment. 


I.  Introduction 

Noise  and  disturbance  are  the  sources  of  problems  in  any  control  system.  In  the 
control  theory  the  problem  of  noise  and  disturbance  have  been  studied  and  solution  are 
considered.  A  way  to  suppress  noise  in  a  control  loop  is  to  insert  a  filter  in  the  path  of  such 
noise  [1].  Adaptive  controllers  have  been  successfully  designed  and  tised  to  compensate  for 
disturbance  or  minor  changes  in  system  dynanucs  [2,3].  One  of  the  shortcoming  of  filtering 
the  noise  and  adapting  controller  parameters  is  the  limited  range  of  operation  over  which 
the  scheme  is  valid.  Other  shortcoming  is  the  robustness  of  such  a  controller.  When  the 
level  of  noise  or  disturbance  becomes  high,  the  performance  of  an  adaptive  controller  or 
filter  is  expected  to  degrade  [4]. 

In  the  past  decade,  use  of  neural  networks  has  gained  interest  in  the  field  of  con¬ 
trol  systems  [5,6].  Several  neurocontrollers  have  been  designed  and  implemented  [7,8]. 
The  pronounce  features  of  a  traditional  neurocontroller  are  robustness  and  computation 
speed.  Neurocontrollers  provide  feist  computation  due  to  the  fact  that  they  are  peirallel 
processing  devices  when  implemented  in  heirdware.  Neurocontrollers  are  robust  in  the  sense 
that  peirtial  failure  in  their  structme  (processing  element  or  connection’s  weight)  does  not 
necessary  lead  to  significant  degradation  in  their  performance.  However,  current  neural 
networks  implemented  as  neurocontroller  are  sensitive  to  noise  and  disturbance.  When  the 
level  of  noise  or  disturbance  increases  a  neural  network  is  expected  to  predict  the  target 
with  some  tmcertainty.  Therefore  building  an  adaptive  neurocontroller  capable  of  on-line 
filtering  of  high  level  of  noise  and  disturbance  is  of  interest  to  control  engineers.  A  purpose 
for  designing  adaptive  neurocontrollers  is  to  implement  an  on-line  learning  procedure.  In 
such  a  training  procedure  a  neural  network  learns  as  it  predicts.  Reinforcement  learning 
is  one  way  for  implementing  on-line  training  of  a  neural  network  [9].  In  the  reinforcement 
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learning  a  neural  network  learns  from  previous  experience.  However,  several  training  cycles 
may  be  needed  before  the  network  learns  new  changes  in  the  environment.  This  maJke  it 
difficult  to  implement  for  complex  control  problems  in  real-time. 

Our  concept  of  building  an  adaptive  nevurocontroller  is  based  on  sharing  information 
between  knowledge  (constant  intelligence)  and  experience  (adaptive  intelligence)  neural 
networks.  In  this  architecture,  first  the  knowledge  is  acquired  by  training  a  neural  network 
on  the  nominal  behavior  of  the  system.  Then  the  experience  is  developed  by  training 
anther  neural  network  to  learn  the  amount  of  adjustment  of  the  knowledge  needed  when 
the  system  is  subjected  to  noise  or  disturbance.  Finally,  the  experience  network  is  used  to 
supplement  the  knowledge  neural  network.  Hence,  the  overall  nemal  network  intelligence 
is  always  up-to-date  and  will  not  render  absolute  as  changes  (noise  or  distmbance)  take 
place  in  the  system. 

II.  Biological  Learning 

Intelligent  biological  systems,  e.g.,  human  beings  and  monkeys,  accumulate  their  in- 
telhgence  over  periods  of  time.  Some  times,  it  is  difficult  to  precisely  identify  the  level 
of  intelligence  of  a  person  at  a  given  instance  of  time.  However,  the  performance  of  a 
person  can  be  observed  over  a  range  of  actions.  The  measure  of  intelligence  of  a  person 
becomes  more  decisive  when  the  complexity  of  the  task  assigned  to  the  person  increases. 
Teaching  a  person  to  perform  certain  task  is  known  as  gaining  knowledge.  If  the  knowl¬ 
edge  of  a  person  is  enhanced  with  time,  the  person  is  said  to  be  gaining  experience.  These 
biological  facts  lead  to  the  following  conclusions:  First,  intelligence  consists  of  two  com¬ 
ponents;  knowledge  and  experience.  A  natural  progression  of  the  two  components  is  that 
knowledge  comes  first  then  experience  is  gained  at  later  stage  of  time.  Second,  experience 
should  always  supplement  and  not  replace  knowledge  to  maintain  intelligence.  Certain 
amount  of  knowledge  should  always  be  kept  constant  without  alteration.  Finally,  knowl¬ 
edge  is  normally  acquired  during  learning  stage  (off-line)  and  experience  is  gained  through 
out  experimentation  (on-line).  Figure  1  illustrates  our  imderstanding  of  such  biological 
learning. 


Figure  1.  (a)  Biological  Intelligence,  (b)  Computational  Intelligence 
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III.  Neural  Network  Architecture 


The  concept  of  biological  learning  is  translated  into  a  neural  network  architecture  us¬ 
ing  reverse  en^pneering  techniques.  The  proposed  neural  network  architecture  is  developed 
keeping  in  mind  the  basic  principles  of  biological  learning.  It  is  known  in  the  sensory  motor 
control  system,  for  example,  a  motor  signal  generated  by  a  motor  neuron  and  transmitted 
through  the  neuron  postsynaptic  is  influenced  by  another  motor  signal  derived  from  the 
same  stimuli  of  the  original  neuron  and  transmitted  though  out  an  inhibitory  intemeuron. 
The  intemeuron  is  coimected  in  a  loop  with  respect  to  the  original  neuron.  This  nemon 
exhibits  time  delay,  firequency  modulation,  and/or  combination  of  both.  It  can  influence 
the  original  motor  neuron  in  excitatory  or  inhibitory  mode  [10].  The  loop  interaction  be¬ 
tween  the  two  motor  neurons  is  a  form  of  implementation  of  knowledge  and  experience  in 
biolo^cal  systems.  The  motor  neuron  represents  knowledge  and  the  intemeuron  is  a  repre¬ 
sentation  of  experience.  Figure  2  shows  biolo^cal  interaction  between  two  motor  neuron. 
This  biological  interaction  between  neurons  is  translated  into  a  computational  neuron  with 
constant  trained  weights  and  variable  biases.  The  actual  value  of  the  biases  for  a  compu¬ 
tational  neuron  is  determined  by  another  feedforwsurd  nemon  similar  to  that  of  biological 
system.  The  feedforward  nevuron  acts  on  the  same  input  data  of  the  computational  neuron 
and  predicts  proper  biases  in  accordance  with  the  current  input  data. 


Attorant  neuron 

innervating  Extensor 

extensor  muscles  rnotor  neuron 


(Courtesy  of  Kandel,  “Principles  of  Neural  Science”) 
Figure  2.  Interaction  Between  Biological  Neuron 


rV.  Adaptive  Neural  Network 

Generalization  of  the  knowledge-experience  computational  neuron  can  be  extended  to 
a  complete  neural  network  architecture.  An  intelligent  neural  network  with  knowledge  and 
experience  components  is  constructed  from  two  sub-neural  networks.  A  knowledge  neural 
network  surrounded  by  an  experience  neural  network.  First  the  experience  neural  network 
reads  input  data  then  it  predicts  the  proper  biases  for  the  knowledge  neural  network. 
The  knowledge  neural  network  in  turn  reads  input  data  and  the  set  of  biases  form  the 
experience  network  then  it  predicts  the  expected  output  of  the  intelligent  neural  network. 
The  following  is  a  mathematical  analysis  of  the  proposed  neural  network  to  demonstrate 
the  interaction  between  knowledge  and  experience  in  a  sharing  intelligence  neural  network. 

Traditional  Neural  Network:  The  performance  of  a  multilayers  nemal  network  can  be 
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described  in  the  testing  (non-training)  mode  by  the  function 

Y(t)  =  ^W  xXit)-\-B)  (1) 

where  tt  define  a  neural  network  with  weight  matrices  W  and  bias  vectors  B.  The  input 

and  output  of  the  network  are  X(t)  and  y’(t)  respectively.  For  a  two  layer  neural  network 
with  unipolar  sigmoid  function  in  the  1st  layer  and  linear  fimction  in  the  output  layer 
equation  1  can  be  expressed  as  a  Matlab  function 

Y{t)  =  pureline(W2,(iansign(Wl,X(t),Bl),B2)  (2) 

where  Bl  and  B2  are  constant  biases  found  diuing  the  training  of  the  neural  network. 
Equation  2  can  be  simplified  as 

Y(t)  =  pur€Une(W2,  B2)  (3) 

where  Al(i)  =  1/(1  +  with 

a(t)  =  Wlx  X(t)  +  Bl  (4) 

the  final  output  of  the  neural  network  is  therefore  expressed  by 

Y(t)  =  K(W2  X  >ll(t)  +  B2)  (5) 

It  is  clear  that  the  knowledge  of  the  above  neural  network  is  stored  in  the  weight  matrices 
IVl  and  1^2.  While  the  biases  Bl  and  B2  contribute  little  to  that  knowledge.  In  the 
following  section,  we  will  show  how  these  constant  biases  of  a  neural  network  can  be  used 
to  enhance  the  knowledge  of  a  neural  network. 

Knowledge  and  Experience  Neural  Network  The  knowledge  of  a  neural  network  can  be 
enhanced  by  making  the  biases  of  the  network  dependent  on  the  input  data.  In  this  case 
equation  4  can  be  written  in  the  following  form 

o(t)  =  Wlx  X(t)  +  Bl  (6) 

^i» 

where  the  bias  vector  Bl  is  a  variable  vector  and  can  be  predicted  by  the  experience  neural 
network  as  follows 

Bl  =  WBl  X  X(t)  +  BEl  (7) 

where  WEI  and  BEl  represent  the  weights  and  biases  of  the  experience  neural  network. 
Substituting  equation  7  into  equation  6  results 

a(t)  =  (W1  +  WEl)X(t)  +  BEl  (8) 

similar  equation  can  be  derived  for  the  final  output  of  the  neural  network 

Y(t)  =  K(W2  X  yll(<)  -f  WE2  x  X(t)  -I-  BjE;2)  (9) 
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Ekiuations  8  and  9  reveal  the  fact  that  intelligence  of  the  neural  network  has  been  dis¬ 
tributed  between  the  knowledge  of  the  network  (Wl,  W2)  and  the  experience  of  the  net¬ 
work  (WElfWE2).  Furthermore,  equation  9  shows  that  the  experience  of  the  network  is 
made  sensitive  to  the  input  data  (WE!2  x  ^(t))  so  that  it  can  accommodate  any  change 
such  as  noise  and  disturbance  in  the  outside  environment  of  the  network.  A  disadvan¬ 
tage  of  such  neural  network  is  that  complete  parallelism  has  been  lost  by  the  amount  of 
delay-time  required  by  the  experience  neural  network  to  reach  steady  state  values  of  the 
biases. 


VI.  Results  Presentation 

An  intelligent  neural  network  with  shared  knowledge  and  experience  architecture  has 
been  designed  and  trained.  The  knowledge  network  consists  of  one  hidden  layer  with  five 
neurons,  two  input  nodes  and  41  output  nodes.  The  experience  neural  network  consists  of 
one  hidden  layer  with  10  neurons,  two  input  nodes  and  46  output  nodes.  The  knowledge 
neural  network  was  trained  to  predict  a  two-dimension  function  of  the  form  F{xi,X2)  = 
g-2(*i+*a)  This  function  represents  a  hat  shape  function.  The  knowledge  network  was 
trained  on  clean  set  of  input  data  and  tested  on  input  data  corrupted  with  up  to  15%  noise. 
The  first  test  was  done  without  adding  the  experience  network,  keeping  the  original  biases 
learned  by  the  knowledge  neural  net  constant.  The  second  test  was  done  by  supplying 
different  set  of  biases  to  the  knowledge  network  at  different  level  of  noise.  Those  biases  were 
predicted  by  the  experience  neural  network.  Results  show  no  significant  deference  between 
the  two  tests  for  low  level  of  noise.  However,  the  deference  is  the  network  performance 
becomes  more  significant  as  the  level  of  noise  increases.  Figure  3  (a,b,  and  c)  shows  the 
prediction  error  of  the  intelligent  neural  network  for  the  three  cases  with  only  15%  noise. 
Figure  3-d  shows  the  sum  of  squared  errors  for  all  levels  of  noise.  An  improvement  of  100% 
is  noticed  at  15%  level  of  noise. 


VII.  Conclusion 

This  study  indicates  that  intelligence  of  a  neural  network  can  be  improved  by  training 
two  specialized  sub-neural  networks  and  connecting  them  in  a  feedforward  form.  The 
proposed  neural  network  architecture  is  proven  to  be  more  robust  as  the  level  of  noise  or 
disturbance  increases.  This  could  be  very  useful  in  control  applications  such  as  designing 
of  neurocontrollers  for  systems  that  exhibit  high  level  of  noise  or  disturbance. 
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ABSmCI 

Backpiopagatian  feedforward  neural  netwoiks  have  been  apfdied  to  pattern  recognition  and 
classification  felons.  However,  under  certain  conditians  the  backpropagation  net  classifier  can 
produce  non-intuhive.  non-robust  and  unreliaUe  classification  results.  The  badqnopagation  net  is 
slower  to  train  and  is  not  easy  to  accommodate  new  data. 

To  solve  die  difficulties  mentioned  above,  an  unsupervised/siqiervised  type  neural  net,  namdy, 
ART2-BP  net,  is  proposed.  The  idea  is  to  use  a  low  vigilance  parameter  in  ART2  net  to  categorize 
input  patterns  into  some  classes  and  dien  utilize  a  badqiropagation  net  to  recognize  patterns  in  each 
class.  Advamages  of  die  ART2-BP  neural  net  include  (1)  improvement  of  recognition  capability,  (2) 
training  convergence  enhancement,  and  (3)  easy  to  add  new  data.  Theoretical  analysis  and  example  are 
given  to  illustrate  diese  advantages. 


nSTKQPUCTlQN 

Pattern  recognition  and  classification  are  potentially  useful  ^)|MDacbes  fra’  inteti»eting  data 
generated  by  industrial  ^sterns  sudi  as  diemical,  manufacturing,  and  well  testing  processes.  Possible 
applications  include  sensor  data  interpretation,  model  identification  and  validation.  Neural  networks, 
e^edally  backpropagation  networks^  have  been  apfdied  to  many  pattern  recognition  proUems 
bidding  the  classification  of  sonar  targets^  and  sensor  interpretation^. 

^jpUcation  of  back-ptr^gadon  networks  to  well  test  model  identification  in  reservoir 
engineeting  has  been  studied  by  several  teseatchets^>^.  These  results  have  shown  that  the  feedforw^ 
badqHopagation  network  classifier  has  the  ability  to  learn  a  set  of  pressure  derivative  curves  and  can 
(ffien  generalize  to  new  cases  of  known  models.  Nevertheless,  several  difficulties  were  uncovered  when 
mote  models  are  included  in  die  net  decision  space  and  vvten  mote  training  curves  are  added  to  the 
training  set^.  Fra  examine,  in  our  simulation,  16  modds  and  30  pressure  deiivadve  data  curves  per 
model  were  used  for  training.  It  took  mote  than  12  hours  on  a  48^PC  fra  the  backpropagadon  net  to 
leam^.  Ktoeover,  it  can  not  cotiecdy  recognize  nuxlels  widi  similar  features.  Furthermore,  the 
backpropagation  net  is  not  robust  since  it  is  not  easy  to  add  new  models. 


This  work  is  partially  supported  by  TVPREP  at  University  of  Tulsa  fitnded  by  12  major  oil 
companies  in  the  world. 
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In  this  paper,  we  propose  a  novel  neural  net  to  remedy  the  aforementioned  difficuUies.  This 
netwoik.  called  the  ART2-BP  net,  uses  an  ART2  nst^  to  son  a  large  number  of  ii^Mit  patterns  into 
several  classes.  Then  die  three-layer  BP  net  associated  with  each  class  node  in  die  output  layer  of  the 
ART2  net  is  trained  using  a  2nd-oider  backpropagation  algorithm^  for  further  classification. 
Advantages  oS  the  ART2-BP  net  include  shorter  training  time,  improved  training  and  classificatimi 
abilities,  and  capability  of  easily  accommodating  new  models. 

Analysis  on  the  advantages  the  ART2-BP  neural  net  provide  is  given  in  Sectirm  n.  The  learning 
algoridun  using  a  conjugate  gradient  method  is  derived.  A  ncxilinear  mapping  technique**  employed 
for  better  classification  will  also  be  showit  A  well  testing  model  recognition  [xoblem  is  used  for  die 
performance  comparismi. 


Rgure  1.  The  ART2-BP  neural  net. 

Thft  AKT2-BP  NEURAL  MET 

The  ART2-BP  neural  network  is  shown  in  Figure  1.  In  this  architecture,  backpropagation  nets 
are  fdaced  direcdy  on  the  output  layer  of  an  ART2  net  First,  ti^own  weights  and  bottom-up 
weights  of  ART2  are  modified  by  the  training  examples.  Then  die  three-layer  BP  net  associated  with 
each  class  node  is  trained  using  a  2nd-order  backpropagadcxi  algorithm^  for  further  classification.  It 
offers  several  advantages  over  the  ART2  net  as  well  as  die  backpropagation  net  Advantages  include 
(1)  improvement  of  recogintion  capal^ty,  (2)  training  ccmvergence  enhancement,  and  (3)  easy  to  add 
new  data.  These  will  be  elucidated  later  on. 

The  training-tccognition  procedure  of  the  ART2'BP  net  can  be  described  as  follows.  Input 
patterns  are  dustered  into  classes  through  the  unsupervised  learning  process  provided  by  ART2  layer. 
At  dus  stage  coarse  classification  was  carried  out  such  that  patterns  with  similar  features  were 
clustered  together.  Patterns  in  each  class  are  then  fawarded  to  the  BP  layer  for  fine  classificaticm.  hi 
dus  phase,  training  is  efiddent  because  faster  learning  algorithm  is  employed.  Furthermore, 
classificadon  is  effective  since  fewer  patterns  are  used. 


I 
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ART2  is  a  caieg(xy  learning  system  that  self-organizes  a  sequence  of  input  patterns  into  various 
reoognitian  categories  (M’  classes.  For  detailed  mechanisro  of  ART2.  {dease  refer  to  reference  7. 
Vi^lanoe  parameter  used  in  ART2  plays  a  pivotal  role.  Hig^  vigLUaice  parameter  will  emdtte  ART2  to 
explicitly  distinguish  patterns  with  similar  features,  but  the  net  wiU  not  be  able  to  recognize  (m*  classiiy 
patterns  OHTupted  with  noise  (ht  di^orted  features.  On  the  other  hand,  if  die  vigilance  parameter  value 
is  too  small,  ataost  all  patterns  will  be  categorized  as  a  single  class. 

The  backpiopagation  algorithm^  on  the  other  hand,  is  a  supervised  learning.  Though  a  great 
deal  of  applications  using  feedfoward  neural  networks  with  die  badquopagarion  algoridun  have  been 
repented,  several  disadvantages  were  also  mentioned.  These  include  ^w  training,  conveigence  failure 
diunng  training,  and  inability  for  the  trained  neural  net  to  accurately  distinguish  patterns  with  similar 
features. 

To  lesson  some  of  the  drawback  of  using  the  delta  learning  rule  or  even  with  die  momentum 
teim^,  a  three-layer  feedforward  neural  netwnk  with  conjugate  gradient  learning  algoridun  is 
emjdoyed.  Using  this  method,  efficient  learning  rate  can  be  selected  and  global  minimum  of  the  error 
surface  can  be  found.  Further,  the  least  squared  error  can  be  reduced  to  less  than  10^  in  a  few 
iterations.  All  these  will  improve  the  training  convergence  quality  and  reduce  the  training  time 
dramatically.  The  updated  weights  under  the  conjugate  gradient  mediod  can  be  expressed  as  follows. 


a*  = 


PjRk 


Pk+\  -  ^+1 


r%I  n  * 


where  is  a  conjugate  basis  with  respect  to  the  Hessian  matrix 

Advanta^  of  this  ART2-BP  neural  net^  are  analyzed  in  the  following. 

1.  Recognition  oqiability  improvement 

It  is  known^^  that  all  die  irqiut  patterns,  exemplar  patterns  and  syntpic  vectors  can  be 
tmmalized  and  mtqiped  into  a  unit  R^  ^ihere.  After  tqiplying  a  nrxilinear  mapping  scheme  developed 
by  Sammon^^  those  ncxmalized  vectors  are  mapped  horn  an  /!”  space  to  an  space  as  shown  in 
Hgure  2.  This  tKxdinear  mapping  die  characteristics  of  {ueserving  the  inherent  stracture  of  the  data. 

In  dus  figure,  ....  and  D5  denote  classes  categorized  by  ART2  and  nti,  m2,  ...,and  are 

centroids  of  Dj,  D2,  ...,  and  D3,  respectively,  x  denotes  die  test  pattern  and  6  is  the  angle  between 


the  test  pattern  and  die  centroid  of  Qass  D3.  Note  that  Dj  e  /?"  and  R"  =  U  I>fnDj=0  if 


i  ^  j.  By  analyzing  the  ounpedtive  learning  mechanism  in  ART2,  wc  find  that  the  winning  synaptic 
vecttu'  mj  of  ART2  equals  the  centroid  of  the  Gass  Dy  This  means  that  mj  defines  die  detemiinistic 
center  of  mass  of  the  class  Dy 


jxp(x)dx 

D: 

fn  =  ,  where  P(x)  rei»esents  a  tuobability  distributiem  of  the  pattern  x. 

^  lp(x)dx 
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Hg.  2.  Cemroids  of  classes  in  ART2  net  Hg.  4.  Ambiguous  windows. 

Ranges  of  this  dass  is  bounded  by  [x^  ,  x^].  Ihese  bounds  dq)end  on  the  vigilance  parameter 
used  in  ART2.  Small  vigilance  parameter  increases  the  range  of  the  dass.  In  other  words,  more 
patterns  will  be  clustered  into  the  same  class  if  a  smaller  vigilance  parameter  value  is  emidoyed  in 
ART2. 

At  this  stage,  we  apply  a  backpropagation  net  to  classify  patterns  categorized  in  the  same  class. 
Suppose  there  are  six  patterns  t,,t2,t3,...,t«  in  Class  Dj  which  has  centroid  as  shown  in  Hg.  3. 
Now  we  assign  eadi  pattern  a  specific  vector  in  the  ouQxit  layer  of  the  BP  net  For  examine, 
[0.9,0. 1,0. 1,0. 1,0. 1,0.1]  is  used  to  represent  t|  and  [0.1, 0.9,0. 1,0. 1,0. 1,0.1]  is  for  t2,  etc.  By  using  the 

Sammon's  ncxilinear  mapping  algoiidim,  t,,t2,t3,...,t(  will  be  located  around  a  drde  as  depicted  in 
Hg.  3(b).  This  means  that  these  six  patterns  can  be  easily  classified  by  the  BP  type  neural  net  in  that 
erqrUdt  boundary  can  be  formed. 


Cba  D, 

Hg.  3.  Six  patterns  in  Class  Di  are  mapped  into  erqrlidtly  classified  pattern  by  Sanuntxi's  technique. 


As  duddated  in  the  above  from  a  theoretic  viewpoint,  the  ART2-BP  net  would  significantly 
improve  the  recogniti<m  performance.  Hgure  4  shows  the  case  when  two  patterns  are  intertwined.  The 
prqxrsed  ART2-BP  neural  net  can  be  trained  to  classified  these  two  patterns,  but  not  the  ART2  net  or 
die  badqiropagaticm  net 

2.  IVainingconvergHMeaihiuiceiiient 

After  dustering  by  ART2,  fewer  patterns  will  be  located  in  the  same  class.  Thus,  when 
tqjplying  die  BP  net  to  each  class,  training  time  is  dramatically  reduced.  Since  explidt  decision 
surfaces  can  be  found,  {xoldems  caused  by  premature  convergence  and  ctmvergence  to  local  minima 
are  diminished.  ’Ilierefore,  in  die  recalling  {diase,  the  BP  net  beiiig  trained  can  produce  satisfactory 
results. 
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3.  Ea«y  to  add  new  data. 

When  a  new  exemidar  pattern  is  added,  it  will  be  either  categorized  into  one  of  the  existing 
classes  or  a  new  class  by  ART2.  Then  only  the  class  having  diis  new  pattern  need  to  be  retained.  This 
feature  is  very  important  iiom  the  extendibility  viewpoint  On  the  contrary,  if  only  the  BP  net  is 
applied,  it  has  to  be  retrained  using  the  v^ide  (old  and  new)  patterns.  This  is  very  time  conamiing  and 
nuiy  cause  sfimti  cmvetgence  problems.  Furthermore,  the  larger  the  number  of  patterns,  die  worse  the 
performance  ttf  the  ART2  or  the  BP  net  Frcvn  diis  point,  we  may  claim  that  the  ART2-BP  net  can 
handle  much  more  data  than  the  ART2  net  or  the  BP  net  does. 

A  Well  Testing  Model  Recognition  Problem 

In  a  pressure  transient  test  a  signal  of  pressure  vs.  time  is  rectxded.  This  signal  is  plotted  as 
derivative  curves  which  are  used  in  die  inteipietation  process.  The  signal  on  diese  curves  is  usually 
deformed  and  shirred  by  some  underlying  mechaiisms  in  the  fonnadon  and  die  wellbore.  Since  more 
than  one  interpretadai  model  can  {xoduce  die  same  signal,  diis  approach  can  lead  to  misleading 
results.  Thus  to  cotrecdy  identify  di^  models  fimn  the  signatures  present  on  the  derivative  [dot  is  of 
great  importance. 


Fig.  S.  Software  imidementation. 


Normali/t'd  TiniP 


Fig.  6.  Pressure  curves  for  training. 


A  software  package  in  C-h-  has  been  developed  in  a  PC-486  oivironmenL  This  program  is 
interactive  and  user-ftiendly.  Not  ordy  can  it  identify  die  model  reflected  by  given  data,  it  also  provides 
an  initial  guess  of  reservoir  properties  as  the  input  to  an  analysis  program.  A  schraiatic  diagram  for  a 
complete  recognition  process  is  shown  in  Hgure  S.  Time-dependent  pressure  data  collected  fiom 
hardware  module  are  fed  into  die  computer  for  lecognidon.  The  ART2-BP  neural  net  is  inqdemented 
in  the  Model  Idendficadon  Module.  After  the  model  is  identified,  regression  techniques  are  utilized  for 
parameter  estimation.  In  die  Confidence  Iruerval  Module,  statistical  characterisdcs  of  the  identified 
model  are  calculated  to  verify  the  results. 

For  the  well  testing  model  recognition  problem  ctnrsideied  here,  ten  pressure  curves  shown  in 
Hg.  6  were  used.  They  belong  to  duee  different  models  with  various  skin  ftictors.  The  three  models 
are  homogeneous  reservoir  and  dual  porosity  reservoir  (slab  model  and  Warren  &  Root  model).  Each 
pressure  derivative  curve  in  the  training  set  was  nmmalized  and  dien  sampled  at  12  points  as  die  iiqxit 
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pattern.  Note  that  the  same  noimalization  method  must  be  used  for  all  curves  including  the  training 
curves  and  the  test  unknown  curves  to  avoid  curves  being  shifted,  enlarged  or  reduced.  A  vigilance 
parameter  of  value  0.9  was  used  in  ART2. 

This  ten-pattern  data  set  was  used  to  train  the  BP  net  and  the  ART2-BP  net  The  12  x  24  x  10 
backpropagation  net  did  not  converge  after  more  than  ten  thousand  iteraticms.  This  problem  is  also 
encountered  in  the  woik  of  Al-Kaabi  and  Lee'*.  Nevertheless,  the  ART2-BP  net  did  learn  and 
successfully  identify  those  curves.  ART2  categorized  the  first  five  curves  in  Fig.  6  into  Oass  1  and  the 
other  five  into  Class  2.  Then  a  smaller  12  x  24  x  5  BP  net  was  utilized  in  each  class.  Three  curves 
were  used  for  recognitioa  The  test  results  are  shown  in  Table  I.  The  first  row  shows  that  the  test 
curve  was  recognized  as  a  homogeneous  reservoir  with  skin  factor  s  equal  to  either  0  or  S. 

TABLE  I  Test  Results 


Test  curve 

Output  of  ART2  (Qass) 

Output  of  BP  (Activation) 

Homogeneous,  s=3 

1 

0.010  0.822  0.944  0.002  0.003 

Slab,  s=3 

1 

0.002  0.002  0.317  0.322  0.705 

Wanen  &  Root,  s=3 

2 

0.023  0.035  0.345  0.654  0.002 

CONCLUSIONS 

This  paper  has  presented  a  new  approach  based  on  ART2  and  BP  neural  nets  to  identify  the  well 
test  interpretation  model  automatically  fiom  the  pressure  derivative  curves.  The  ART2-BP  net  has 
better  recognition  capability  ^  is  easy  to  accommodate  new  models.  Moreover  we  have  demonstrated 
that  the  limitations  of  the  backpropagation  network  can  be  relaxed  fiom  a  theoretic  viewpoinL 
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Abstract: 

The  present  p^r  consists  of  two  parts.  In  the  first  part,  a  unified  geometrical  interpretation  of  the  behavior  of 
multilayer  feedforward  netwoiks  (MFN)  is  presented.  The  inputs  and  ouQMits  of  training  samples  can  be  repre¬ 
sented  as  two  separated  matrices  and  an  MFN  is  a  mechanism  that  successively  transforms  the  input  matrix  into  the 
desired  output  matrix,  via  the  intermediate  matrix  (or  matrices)  associated  with  hidden  layerfs).  By  thinking  of  the 
matrix  as  a  multidimaisional  pliable  object,  the  successive  matrix  transformation  can  then  be  compared  to  a  se¬ 
quence  of  stretchings  and  squeezings  of  the  imaginary  object  This  interpretation  holds  for  both  binary  and  continu¬ 
ous  function  mappings,  as  well  as  for  mappings  where  both  input  and  output  space  are  multidimensional,  i.e.,  not 
being  limited  to  n-to-1  mappings.  More  importantly,  this  interpretation  provides  a  whole  new  perspective  to  sever¬ 
al  important  yet  still  unanswered  questions  about  MFN.  In  particular,  the  generalization  capability  of  MFN  seems 
to  be  the  result  of  cntain  symmetry  within  the  underlying  mapping  function.  In  the  second  part,  implications  of  the 
interpretation  will  be  elaborated  specifically  in  regards  to  quantifying  mapping  nonlinearity.  Novel  schemes  will 
be  suggested  to  quantify  the  mapping  nonlinearity  based  upon  the  spatial  characteristics  of  training  samples,  and 
provide  guidelines  to  avoid  hard-leaming  situations  by  reducing  the  mapping  nonlinearity  of  training  samples. 
Illustrative  examples  and  results  of  numerical  experiment  are  presented  in  support  of  the  interpretation  concepts. 

1.  Introduction 

Multilayer  feedforward  networks  (MFN)  have  been  the  most  widely  explored  of  all  neural  network  paradigms.  In  re¬ 
gards  to  explaining  the  behavior  of  MFN,  previous  publications  such  as  [Nilsson],  [Lippmann]  and  [Pao] 
have  provided  primitive  interpretations  in  terms  of  decision  boundaries  and  decision  regions  from  a  mapping 
perspective.  However,  these  interpretations  are  only  good  for  classification  problems,  where  the  mappings 
are  n-to- 1  and  with  binary  or  discrete  outputs.  In  this  paper,  a  novel  and  unified  interpretation  of  MFN  is  presented 
to  assign  a  more  physical  sense  to  the  MFN.  This  interpretation  is  valid  not  only  for  both  binary  and  continuous  func¬ 
tion  mappings,  but  also  for  the  mappings  where  both  input  and  output  space  are  multidimensional  (not  being  limited  to 
n-to-1  mappings). 

The  inputs  and  outputs  of  training  samples  can  be  represented  as  two  separated  matrices,  corresponding  to  the  input  and 
ouqrut  states  of  an  imaginary  multidimensional  pli^le  object.  Geomeuically,  MFN  can  be  thought  of  as  a  mechanism 
that  successively  reshapes  such  an  imaginary  training  object  from  its  input  state  to  its  output  state,  via  the  intermediate 
state(s)  associated  with  the  hidden  layeifs).  The  weight  matrix  (including  the  activation  function)  between  two  adja¬ 
cent  layers  corresponds  to  a  reshape  operation,  which  consists  of  two  basic  actions,  stretching  and  squeezing.  Mathe¬ 
matically.  stretching  corresponds  to  the  dot  product,  and  squeezing  corresponds  to  the  processing  through  an  activation 
function.  Each  of  the  hidden  and  ouqtut  node  is  associated  with  a  weight  vector  that  corresponds  to  a  column  vector  in 
the  associated  weight  matrix.  The  weight  vector  indicates  the  direction  of  stretching  and  squeezing,  as  well  as  the 
strength  of  stretching.  Subsequently,  the  training  of  an  MFN  can  be  thought  of  as  the  process  to  find  an  ap[xopriate  way 
that  can  successively  reshape  the  training  object  from  its  input  state  to  its  desired  ouqrut  state.  On  the  other  hand,  the 
prediction  behavior  of  MFN  can  be  thought  of  as  the  result  of  a  successive  reshaping  of  the  continuum  training  object. 

This  interpretation  may  shed  light  on  several  important  yet  unanswered  questions  in  MFN,  such  as  what  constitutes  a 
hard  learning  case,  why  MFN  can  generalize,  and  what  is  hidden  in  the  hidden  layers.  This  paper  does  not  intend  to 
tackle  the  problem  of  deciding  the  number  of  hidden  nodes/layers;  instead,  it  suggests  certain  guidelines  for  the  prepro¬ 
cessing  of  raw  training  samples.  The  distribution  angle  and  maximum  distribution  gradient  are  introduced  to  gauge  the 
mapping  nonlinearity.  Prior  to  the  training,  one  may  preprocess  the  raw  samples  so  that  the  mapping  nonlinearity  is 
reduced  in  order  to  facilitate  the  training. 

The  paper  is  organized  into  two  parts.  Part  I  focuses  on  the  unified  interpretation  of  MFN,  with  an  illustrative  example 
and  ^scussions  about  the  implications  of  the  interpretation.  Novel  schemes  for  quantifying  mapping  nonlinearity  will 
be  presented  separately  in  Part  II. 
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2.  MFN  :  Successive  Matrix  IVansfonnation 

MFN  is  often  explained  in  biological  terms  such  as  neurons,  axons,  synapse,  etc.,  for  its  analogy  to  the  neural  system  of 
living  beings.  Ktethematical  descriptions  can  be  found  in  standard  texts  on  the  subject.  [Lippmann,  Kosko,  Pao].  Hoe, 
MFN  is  first  described  from  the  perspective  of  matrix  transformations  and  then  associated  with  specific  interpretations. 
MFN  is  known  to  be  able  to  approximate  any  mapping  function,  represented  by  a  set  of  mapping  samples  (also  called 
training  patterns),  to  any  degree  of  accuracy[Honic].  Let  the  mapping  function  be  denoted  as  f  :  R”-*  R",  and  the  sam¬ 
ples  given  as  (Ij,  Oi),  where  (-L-.P.  li  ^  R"'.  and  O,-  £  R".  (In  practice,  the  input  and  output  of  samples  are 

usually  scaled  to  either  binary  hypercubics  [0,1]'”/[0,1]”,  or  bipolar  hypercubics  [-l,l]"'/[-l,l]",  rather  than  being 
used  with  infinite  ^taces.)  The  input  and  output  p^  of  samples  can  be  rqxesented  as  two  separate  matrices,  denoted 
as  [I]pxin  and  [Ojpxn  respectively,  where  p  is  the  number  of  samples,  m  and  n  are  the  dimensions  of  input  and  output 
spaces.  Sui^xrse  a  single  hidden  layer  MFN  is  trained  with  sample  (Ij,  Oi),  i=l,..4>.  It  is  clear  that  when  an  input  vector 
li  is  i»esent^  to  this  MFN,  I,-  is  first  transformed  into  an  intermediate  vector  Hi  at  the  hidden  laya,  and  then  into  the 
ou^nit  vector  Oi.  Similarly,  when  the  entire  input  matrix  [I]pxin  is  presented  to  the  MFN,  [Ilpxm  is  first  transformed 
into  the  intermediate  matrix,  [H]pxq  (where  q  is  the  number  of  hidden  nodes),  and  then  into  the  output  matrix  [0]pxn- 
Hence,  an  MFN  is,  in  fact,  a  successive  matrix  transformation  mechanism. 

The  successive  matrix  transfwmation  in  an  MFN  can  be  compared  to  “playing  dough"  in  hyperspace.  Each  layer  in  an 
MFN  can  be  associated  with  a  multidimensional  space  of  which  the  dimension  equals  the  number  of  nodes  in  the  lay». 
Fw  a  specific  layor,  a  group  of  vectors  (i.e.,  a  matrix)  can  be  thought  of  as  a  group  of  points  in  the  associated  multidi¬ 
mensional  space.  One  can  imagine  the  whole  group  of  points  as  a  multidimensional  object  which  is  constructed  by 
connecting  every  point  with  each  other  using  a  rubber  stick.  The  rubber  sticks  allow  the  imaginary  object  to  be  arbi¬ 
trarily  pliable.  In  this  tiumner,  the  successive  matrix  transformation  can  be  thought  of  as  a  successive  reshaping  process 
of  the  imaginary  pliable  object.  Mathematically,  a  matrix  with  dimension  pxq  represents  a  specific  state  of  the  imagi¬ 
nary  pliable  object,  where  p  is  the  total  numb^  of  vectors  arrd  q  is  the  space  dimension  (i.e.,  the  number  of  nodes  in  the 
layer).  In  this  research,  such  object  is  referred  to  as  trainjng  obyecr,  and  the  matrix  is  referred  to  as  the  state  matrix  of  the 
training  object  Hence,  in  the  MFN  example  above,  the  three  matrices  [I],  [H]  and  [O],  reixesent  the  three  states,  input, 
intomediate  and  ouqwt,  of  the  training  object  The  reshaping  operation  can  be  decomposed  into  two  basic  actions, 
stretching  and  squeezing,  which  can  be  mathematically  associated  with  the  dot  product  and  the  processing  through  an 
activation  function,  as  elaborated  next. 

3.  Stretching  and  Squeezing 

The  principle  of  the  matrix  transformation  (or  reshaping)  between  any  two  adjacent  layers  is  identical,  regardless  of 
any  layer  being  an  input,  output  or  hidden  one.  Therefore,  in  Figure  1 ,  two  adjacent  layers  are  used  to  explain  stretching 
and  squeezing.  The  two  layers  are  referred  to  as  the  source  lay».  SL,  and  the  target  layer,  TL,  instead  of  input/output 
layer.  Suppose  there  are  p  samples,  and  the  numbers  of  nodes  in  SL  and  TL  are  s  and  t  respectively.  The  matrices  [S]p  x  s 
and  [T]pxt  then  denote  the  source  and  target  states  of  the  training  object.  Note  that  in  practice,  it  is  preferable  to  add  a 
bias  node  in  SL  to  help  the  training,  as  explained  later.  Hence,  [Sjpxs  is  augmented  as  [S]px(s^i).  in  which  the  last 
colunm  is  a  unity  vector.  Accordingly,  the  weight  matrix  between  SL  and  TL  is  denoted  as  [>^(i4^i)xt- 
Now  one  may  look  into  the  function  of  a  target  node.  In  Figure  1.  the  target  node  Ni  is  associated  with  a  weight  vector 
Wt,  i.e.,  a  column  vector  of  the  weight  matrix.  Two  mathematical  operations  perfcnmed  inside  a  target  node.  First,  the 
dot  product  between  the  associated  weight  vector  and  an  input  source  vector,  Wt*Sp,  and  second,  the  prcKessing 
through  an  activation  function  of  Wt«Sp.  When  a  state  matrix  of  a  training  object  is  presented  to  a  taiget  node,  the  effect 

SL  (source  layer)  TL  (target  layer) 

s+1  nodes  t  nodes 


(streteh)  r«w*»l 


Figure  1.  Operations  between  two  adjacent  layers. 
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is  equivalent  to  stretching  and  then  squeezing  the  object  along  the  direction  of  the  weight  vector  associated  with  the 
target  node.  The  length  of  weight  vector  is  equivalent  to  the  magnitude  of  stretching. 

Dot  Product  as  Stretching 

Conceptually,  applying  a  dot  product  between  a  weight  vectm’  and  each  point  vector  on  the  pliable  object  is  equivalent 
to  stretching  the  object  along  the  direction  of  the  weight  vector.  This  is  explained  in  Figure  2,  where  there  are  a  weight 
vecttv  Wt,  a  point  vector  Sp.  and  an  imaginary  neutral  plane  that  is  perpendicular  to  Wiand  passes  through  the  origin. 
The  dash  line  marks  die  location  of  the  neutral  plane,  d  is  the  distance  from  Spto  the  neutral  plane  (i.e.,  the  projection  of 
Sp  on  Wt).  and  0  is  the  angle  between  Spand  W|.  Then 
d  =  ISpl  cos  0. 

On  the  other  hand, 

w,*Sp*iWii  ISpl  cose 
=  IW,ld. 

Therefore,  after  tpplying  a  dot  (voduct  between  a  weight  vector  Wt  and  every  point  vector  on  the  training  object,  the 
distance  from  every  point  to  the  neutral  plane  is  scaled  up  by  IW|I.  In  other  woi^,  all  points  are  stretched  away  from  the 
neutral  plane  (in  both  sides  of  the  plane),  except  for  those  points  on  the  neutral  plane  being  unaffected.  Note  that  IWtl  is 
often  greater  than  1,  thus  results  in  the  stretching  effect  In  case  IW|I  equals  1,  the  projected  lengths  do  not  change.  If 
iWtl  is  less  than  1,  the  projected  length  actually  shrinks.  In  practice,  if  IWtl  is  relatively  small  (e.g.,  <  1),  one  may 
crmclude  that  the  associated  node  is  redundant  and  can  be  removed.  (Trimming  redundant  hidden  nodes  is  a  separate 
subject  and  is  not  pursued  in  this  research.) 

To  be  mote  precise,  the  dot  product  between  a  weight  vector  Wt  and  every  point  vector  on  the  uaining  object  essentially 
converts  every  point  vector  into  the  value  that  equals  the  projected  length  of  the  point  vector  on  Wt  and  enlarged  by 
IWtl.  The  multiplication  between  a  state  matrix,  [S]px(s-ti),  and  a  weight  matrix,  [W](s-f  i)xt.  can  then  be  realized  as  the 
process  that  converts  each  s-dimensional  point  Vsi  into  a  {-dimensional  point  V|,-  via  the  dot  products  between  each  v,i 
and  the  t  column  vectors  in  [W].  Subsequently,  the  training  object  can  be  imagined  as  being  stretched  from  j-dimen- 
sional  one  to  {-dimensional  one,  by  stretching  the  reject  along  t  directions  at  one  time.  Such  a  simultaneous  stretch  is 
referred  to  as  hyperstretch,  which  is  different  from  an  intuitive  stretch  because  of  the  change  of  object  dimensionality  - 
one  might  need  some  imagination  to  perceive  the  hyperstretch.  However,  intuitively  one  can  perceive  that  the  hyper¬ 
stretch  can  result  in  a  nonlinear  distortion  of  the  topological  appearance  of  training  object,  b^ause  of  the  nonlinear 
change  of  the  relative  distances  among  training  pdnts. 

The  function  of  the  bias  node  in  SL  now  becomes  clearer.  Adding  a  bias  node  at  SL  is  equivalent  to  placing  the  object  in 
an  added  dimensionality  space  where  the  stretch  is  made.  For  an  s-dimensional  object,  the  stretch  can  be  made  in 
(s+/)dimension  (e.g.,  a  2D  object  is  to  be  stretched  in  a  3D  space).  It  is  intuitively  perceivable  that  given  the  extra 
dimension  for  stretching,  it  should  be  easier  to  achieve  desired  distortions  of  training  object. 

Activation  Function  as  Squeezing 

Figure  3  shows  a  typical  sigmoid  activation  function,  y-ll(l  where  T  is  a  constant  (referred  to  as  temperature). 
The  activation  function  is  also  called  threshold  or  squash  function.  This  function  converts  a  1 D  infinite  space  into  a  unit 
region  [0, 1],  and  the  origin  is  mapped  into  the  center  (i.e.,  0.5).  In  Figure  3,  the  region  marked  by  [-x.  x]  can  be  defined 
as  an  ^ective  region.  The  space  inside  the  effective  region  is  nonlinearly  normalized  to  (0,1),  while  the  space  outside 
the  region  is  squashed  nearly  to  the  boundary  of  the  region.  For  the  activation  function  with  r=l ,  the  effective  region  is 
igiproximately  [-4.5, 4.5],  which  maps  to  [0.01, 0.99].  Note  that  the  activation  function  only  accepts  a  scalar  input. 
>^en  it  is  applied  to  a  state  matrix  of  a  training  object,  it  is  applied  to  each  element  of  the  matrix.  Therefore  an  n-di- 
mensional  space  is  squeezed  into  an  n-dimensional  hypercubic  [0, 1]". 

Schematical  examples  are  helpful  in  visualizing  the  multidimensional  squeezing  effect  Figure  4  shows  a  2D  example, 
where  three  arbitrary  objects,  A,B  and  C  are  squeezed  Voa,b  and  c  respectively.  Note  the  difference  in  scales  as  well  as 


Figure  2.  Dot  product  as  stretching  Figure  3.  A  sigmoid  activation  function 
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Figure  4.  2D  Squeezing  Example. 

The  objects  4,  B  and  C  are  squeezed  toa.b  and  c respectively. 
Note  the  difference  in  scales  and  the  nonlinear  change  of  gritk. 


the  nonlinear  change  of  grids  locations.  The  entire  space  (only  region  [-6, 6]^  is  shown)  is  squeezed  into  [0, 1]^;  the 
(nigin  (0,0)  is  mapped  into  the  new  center,  (0.5,  0.5).  The  effective  region  in  this  case  is  approximately  [-4.5, 4.5]^. 
Chie  sees  that  the  space  distortion  increases  sharply  with  respect  to  the  distance  from  the  origin,  and  gets  saturated 
qjproximately  around  the  boundary  of  the  effective  region.  For  example,  object  B  has  the  least  distortion,  and  object  A 
has  less  distortion  for  the  portion  closer  to  the  origin.  ForobjectSi4  and  C,  the  portions  lying  outside  the  effective  region 
are  distorted  dramatically,  and  are  squashed  to  near  the  boundary  of  the  effective  region.  Specifically,  comparing  A 
with  a,  it  seems  that  the  portion  of  A  lying  outside  the  effective  region  is  chopped  off,  although  none  of  the  points  on  A 
are  actually  lost  Such  a  pseudo-chopping  may  help  to  perceive  the  change  of  topological  feature  when  an  over¬ 
stretched  multidimension^  object  is  squeezed  (overstretched  pertains  to  those  portions  of  an  object  that  are  stretched 
to  outside  of  the  effective  region). 


As  a  scalar  operation,  the  activation  function  reaUy  needs  no  direction  to  apply.  In  this  papa',  squeezing  is  associated 
with  the  direction  of  stretching  because  conceptually  squeezing  effect  occurs  exactly  in  the  same  direction  of  stretch- 
ittg.  Similar  to  the  concqrt  of  hyperstretching ,  a  hypersqueezing  refas  to  a  set  of  squeezings  associated  with  the  multi¬ 
ple  directions  in  a  weight  matrix. 


4.  Illustrative  Example 

The  well-known  2-bit  xc»  problem  is  used  to  illushate  the  successive  hypersttetchings  and  hypersqueezings.  The 
problem  is  to  approximate  the  mapping  where  the  four  points,  (0,0),  (0,1),  (1,0)  and  (1,1)  map  to  0, 1, 1,  and  0  respec¬ 
tively.  These  four  samples,  labeled  as  pi,  p2,  p3  and  p4,  are  trained  by  an  MFN  with  two  hidden  nodes  organized  in  a 
single  hidden  layer.  Figure  5  shows  the  various  state  matrices  of  the  training  object.  Hi,  H,,  Me.  Hi.  and  M^,  to- 
getha  with  the  two  weight  matrices  associated  with  the  hidden  layer  and  output  layer,  [W]0)  and  [W]^\  H,  is  the  input 
state  matrix,  where  the  last  column  is  a  unity  vector  associated  with  the  bias  node.  results  from  the  hyperstretching 
of  Hi  by  [W]0).  Me  is  obtained  by  squeezing  (i.e.,  each  component  in  H*  is  processed  through  the  activation  func¬ 
tion),  and  including  a  unity  column  vector  for  the  bias  node.  Next,  the  stretching  by  is  applied  to  H:  and  yields 
Hr-  Finally,  Hr  is  squeezed  to  H.  which  is  the  predicted  output  state  and  is  very  close  to  the  desired  output, 

As  a  counterpart  of  Figure  5,  Figure  6  shows  the  successive  resitting  {xocess  of  the  training  object  The  object  is 
marked  by  pi,  p2,  p3,  p4,  and  dashed  lines.  The  weight  vectors  applied  to  stretch  the  object  are  shown  in  Hi  and  H- 
(The  weight  vectors  are  not  shown  with  exact  scale,  rather  they  show  the  approximate  ai^licatitm  directions.)  Figure 
6-Hi  is  a  3D  view  of  the  training  object,  where  one  can  visualize  the  effect  of  the  bias  nc^e:  the  2D  object  is  placed  in 
the  space  with  an  extra  dimension  and  shifted  to  the  unity  location  in  that  dimension.  Figure  and  6-H;  show  the 
2D  view  of  the  training  object,  where  pi  and  p4  nearly  coincides  with  each  other,  in  6-H;.  the  third  dimension  is  not 
shown  and  corresponds  to  the  axis  that  is  perpendicular  to  the  figure  plane  and  passes  through  the  origin.  In  6-Hr  and 
6-Me,  the  object  is  shown  on  a  straight  line  for  the  object  has  been  squeezed  to  a  ID  space.  One  can  observe  that  the 
topological  appearance  of  the  training  object  is  evolved  gradually  from  the  input  state  to  the  ouqiut  state.  Obviously, 
visualization  of  the  successive  reshaping  process  would  be  difficult,  if  not  impractical,  when  the  object  dimension  is 
greater  than  three. 
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Figure  S.  Successive  matrix  transformation  of  the  2-bit  xor  problem 

input  state;  Mi,  H;  ■nd  Mr.  intermediate  states  after  a  stretdiinje,  a  squeezing,  and  another  stretching; 

Mr.  predicted  output  state;  lAf,  desired  ouqnit  state;  and  [W]w,  weight  matrices  for  hidden  layer  and  output  layer. 


Figure  6.  Successive  resh^ing  of  the  training  object  of  the  2-bit  xor  problem 

The  training  object  is  associated  with  the  state  matrices  Ma,  Mi,  Mr,  and  Mr  in  Figure  S. 
The  weight  vectors  associated  with  and  are  shown  in  states  Ma  and  Me. 


5.  Implications  of  the  Interpretation 

Although  the  illustrated  xext  problem  is  an  n-to-1  binary  ma^iping  function,  the  inteipretation  also  holds  for  mappings 
that  are  continuous  and  multidimensional  (i.e.,  n-to-m).  This  is  obvious  since,  firstly,  the  intmiediate  state  matrices 
actually  possess  teal  numbers  even  for  binary  mappings,  and  secondly,  the  interpretation  is  independent  of  the  dimen¬ 
sion  of  state  matrix.  With  this  intopretation,  the  training  of  MFN  can  be  interpreted  as  the  process  to  find  out  how  to 
reslupe  the  training  object  from  the  input  state  to  the  desired  output  state,  i.e.,  to  decide  the  directions,  strengths,  and 
perhaps  the  rqietition  times,  of  hyperstretchings  and  hypersqueezings.  On  the  otho-  hand,  the  use  of  MFN  for  jnedic- 
tion  can  be  thc^ht  of  as  the  result  of  successive  reshaping;  a  novel  input  is  marked  in  the  continuous  input  space,  when 
the  mtire  input  qxice  is  reshaped  to  the  output  state,  the  marked  location  represents  the  predicted  output. 

From  Figure  6.  one  can  observe  that  during  the  sequence  of  reshaping,  the  topological  appearance  of  the  training  object 
is  evolving  toward  that  of  the  desired  state.  The  evolving  principle  is  that  after  each  reshaping  operation,  those  points 
that  are  close  to  (or  far  away  frexn)  each  other  in  the  desired  ouqxit  state  tend  to  be  pulled  closer  (or  pushed  furtho- 
away)  to  each  other.  This  obsovation  will  be  elaborated  in  the  next  section,  where  novel  schemes  for  quantifying  map¬ 
ping  nonlinearity  are  presented  based  on  this  observation.  Discussed  below  are  a  few  more  implications  of  the  intn- 
pretation. 

Generalization  with  MFN 

The  iHoperty  of  generalization  in  MFN  has  been  particularly  elusive.  Mathematically,  generalizatitxi  is  diffoent  from 
interpolation  in  that  inierpdation  always  predicts  an  outcome  based  upon  the  neighboring  samples,  while  genoaliza- 
tion  allows  the  prediction  to  be  made  on  the  basis  beyond  the  neighboring  samples.  Take  the  n-bit  xcxi  problem  as  an 
example.  The  interpolation  tends  to  predict  a  wrong  outcome  due  to  the  fact  that  in  the  input  space,  each  point  always 
has  an  ouQHit  value  that  is  opposite  to  that  of  its  immediate  adjacent  points.  Hence  if  the  prediction  is  bas^  on  neigh¬ 
boring  samples,  it  tends  to  yield  an  outcome  that  is  similar  to  that  of  its  neighboring  points,  i.e.,  a  false  one.  On  the  other 
hand,  it  is  well  known  that  the  MFN  can  predict  satisfactorily  with  such  problems:  it  can  overpass  the  neighboring 
barrier  to  achieve  the  generalization. 
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This  pfopeny  can  be  explained  from  the  viewpoint  of  reshaping  the  training  object.  For  the  n-bit  xor  problem,  if  one 
marks  each  training  point  at  input  state  with  the  C(»responding  output  value  (either  0  or  1),  the  m^ing  function  must 
exhibit  certain  symmetry.  Due  to  this  symmetry,  those  regions  with  insufficient  number  of  training  points  will  also  be 
reshaped  in  a  correct  manner  and  hence  a  novel  input  may  get  maf^ped  into  a  correct  output  through  the  symmetrical 
relationship  with  respect  to  the  weight  vectors.  In  other  words,  it  is  this  symmetry  that  makes  the  generalization  pos¬ 
sible.  Shot^  the  input  state  training  points  be  insufficient  to  characterize  the  symmetry  (if  such  symmetry  exists)  of  the 
underlying  mapping  function,  generalization  would  not  be  possible.  On  the  other  hand,  should  the  underlying  making 
function  possess  no  symmetry,  the  generalization  would  be  meaningless  and  the  MFN  would  behave  like  a  regular 
function  approximator. 

Uniqueness  of  Weight  Vectors 

Another  fiequendy  asked  question  is  “For  a  trained  MFN.  is  the  set  of  weight  vecuns  unique?  If  not,  what  is  the  rela¬ 
tionship  between  the  feasible  solutions?"  The  fust  questitxi  can  be  restated  as  “Is  the  set  of  intermediate  states  of  a 
trained  MFN  unique?”  to  exclude  the  situations  where  the  weight  matrices  consist  of  different  permutation  of  weight 
vectors.  From  the  perspective  of  successive  reshaping,  it  seems  that  the  answer  must  be  negative,  i.e.,  the  set  of  inter¬ 
mediate  states  is  not  necessarily  unique.  In  the  sequence  of  reshaping,  one  may  distort  the  training  object  more  strongly 
in  certain  reshaping  directions,  while  compensate  with  less  distortion  in  other  directions.  Consequendy,  it  would  result 
in  different  sets  of  intmnediate  states.  From  this  point  of  view,  redundant  hidden  nodes  in  an  MFN  can  be  thought  of  as 
unnecessary  stretchings/squeezings,  which  would  obviously  result  in  different  intermediate  states  in  addition  to  caus¬ 
ing  undesirable  over-fftting  effects. 

The  relationship  between  all  feasible  sets  of  weight  vectors  can  be  explained  from  the  perspective  of  opdmizadon  (i.e., 
the  minimization  of  the  error  function).  Since  the  feasible  intermediate  states  iqq)ear  to  be  continuously  nshapable  to 
each  otliCT,  the  corresponding  weight  vectors  must  exhibit  a  form  of  ridges  in  the  weight  space.  In  other  words,  the 
global  minimum  of  the  error  function  appears  to  be  hyper-ridges  that  have  infinite  number  of  minimum,  rather  than  a 
valley  with  a  unique  minimum. 

Linear  Separability 

With  the  {Hesenied  interpretation,  the  linear  separability(Nilsson,  Pao]  can  be  interpreted  as  the  feasibility  of  reshaping 
a  training  object  from  the  source  state  to  the  target  state  with  only  a  fixed  number  of  stretchings  and  squeezings  (i.e., 
one  hyperstretching  and  one  hypersqueezing),  where  the  numbtf  equals  the  number  of  output  nodes  (or  the  target 
space  dimension).  Obviously,  if  the  u^logical  appearance  of  input  and  ouqrut  stales  diffta-  from  each  other  consider¬ 
ably.  the  fixed  number  stretchings/squeezings  can  not  achieve  the  desired  transformation.  In  such  situations,  the 
successive  reshaping  is  required  where  one  has  freedom  to  apply  more  stretchings/squeezings,  and/or  repeat  hyper- 
stretchings/hypersqueezings  (the  number  of  repetitions  is  corresponding  to  the  number  of  hidden  layers).  For  a  trained 
MFN,  one  fi^  tint  any  two  adjacent  slates  automatically  satisfy  the  linear  sqnrability. 

6.  Concluding  Remarks 

The  essence  of  the  geometrical  interpretation  is  that  the  training  samples  are  regarded  as  two  states,  input  and  output,  of 
an  imaginary  training  object  which  is  multidimensional  and  pliable.  The  weight  matrix  between  two  adjacent  layers, 
including  the  activation  function,  is  associated  with  a  resh^ing  operation:  the  matrix  multiplication  is  thought  of  as  a 
hyperstretching  and  the  activation  function  is  thought  of  as  a  hypersqueezing.  By  successive  hyperstretchings  and  hyp- 
osqueezings,  the  training  object  is  transformed  from  the  input  state  to  the  desired  output  state,  through  the  inteim^- 
ate  states  associated  with  hidden  layers.  This  interpretation  is  valid  for  multidimensional  (n-to-m)  mappings  including 
classification  problems,  as  well  as  for  both  binary  and  continuous  mappings.  A  2-bit  xor  problem  was  illustrated  to 
show  the  successive  reshaping  process.  The  evolving  principle  in  a  trained  MFN  is  observed  as  follows:  the  points  that 
are  close  to  (far  away  from)  each  other  in  the  desired  ouqxit  state  tend  to  get  closer  to  (further  away  from)  each  otho' 
after  each  reshsqring  qjeration.  This  intopretation  may  have  solved  many  mysteries  about  MFN.  In  particular,  the  gen¬ 
eralization  of  MFN  seems  to  be  the  result  of  certain  symmetry  within  the  underiying  mapping  function.  In  the  second 
part  of  this  paper,  the  inter{xetation  will  be  extended  to  introduce  novel  schemes  for  quantifying  moping  nonlinearity 
based  on  the  spatial  characteristics  of  mapping  samples. 

Reference 

[Htrniik]  K.  Homik,  M.  Stinchcombe  and  H.  White,  Multilayer  Feedforward  Netwoiks  are  Universal  Approximators, 
Neutd  Networks,  Vol.  2,  pp  359-366,  1989. 

[Kosko]  B.  Kosko,  Neural  Netwenks  and  Fuzzy  Systems,  Prentice-Hall,  Inc.,  NJ.,  1992. 

[UpfHnann]  R.P.  Liiq)mann,  An  Introduction  to  Computing  with  Neural  Nets,  IEEE,  ASSP  Mag.  pp.  4-22,  April  1987. 
[Nilsson]  NJ.N!Vson,The  mathematical  Foundations  of  Learning  Machines,  Morgan  Kaufmann  Publishers,  Inc.1990. 
[Pao]  Y.-H.  Pao,  Adaptive  Pattern  Recognition  and  Neural  Nwworks,  Addison- Wesley  Inc.  1989. 


in-630 


On  A  Unified  Geometrical  Interpretation  of  Multilayer  Feedforward  Networks 
Part  II:  Quantifying  Mapping  Nonlinearity 
Hauhua  Lee 

GE  Corporate  Research  and  Development  Center,  Schenectady,  N.Y.  12301 

Prabhat  Hajela 

Rensselaer  Polytechitic  Institute,  lYoy,  N.Y.  12180 

Abstract: 

This  is  the  second  of  a  two-part  papa.  The  geom^rical  interpretation  of  MFN  presented  in  Part  I  allows  the  nug>- 
piitg  nonlinearity  to  be  quantified  based  on  the  spatial  characteristics  of  training  samples.  The  normalized  object 
distribution  vector  (OD^  is  introduced  as  a  generic  representation  of  a  multidimensional  object.  This  representa¬ 
tion  is  ind^rendentof  the  dimoision,  as  well  as  the  size,  location  and  orientation,  of  the  object.  Based  on  ODV,  two 
types  of  measurement  are  suggested  to  gauge  the  mapping  nonlinearity  between  mapping  samples:  thedisoibution 
angle  a,  and  the  maximum  distritnition  gradient  To  facilitate  the  training  process  (or  avoid  hard-leaming 
situations),  one  should  try  to  reduce  a  and  %tax  during  the  preparation  of  training  samples.  The  schemes  are  sup¬ 
ported  by  results  of  numerical  experiments,  including  an  elaborated  one-to-orte  and  continuous  mrqrping  example. 

1.  Introductirm 

This  is  the  second  of  a  two^Mtt  paper.  In  Part  I,  a  unified  geometrical  interpretation  of  the  behavior  of  multilayer  feed¬ 
forward  networics  (MFN)  was  presented.  There,  MFN  was  shown  to  be  a  successive  matrix  transfcMmation  mechanism, 
where  a  matrix  can  be  d^ght  of  as  rqHesenting  a  state  of  the  imaginary  training  object  The  successive  matrix  trans¬ 
formation  was  shown  to  be  analogous  to  a  sequence  of  hyperstretchings  and  hypersqueezings  of  the  training  object 
This  interpretation  hrdds  fix’  both  binary  and  continuous  fimction  mailings,  and  for  mappings  where  both  input  and 
ouqait  spaces  are  multidiniensional.  More  importantly,  the  interpretation  has  opened  up  a  new  perspective  to  die  prob- 
lem  of  quantifying  mapping  nonlinearity,  a  perspective  in  view  of  the  spatial  characteristics  of  training  samples.  Con¬ 
ventionally.  efifixts  on  quantifying  miqtping  nonlinearity  (also  called  mapping  complexity)  have  been  made  Grom  a 
smtistiMil  view  point  Ftx  example,  [Baum]  suggested  a  relationship  between  the  number  of  training  samples  and  the 
number  of  hidden  nodes.  More  recendy,  [Koiran]  d&ived  a  stricter  relationship  between  the  two  numbera  by  taking 
into  account  a  specific  q;)atial  characteristic  of  the  training  samples,  the  small^  distance  between  two  samples  that 
have  different  ouqnits  (for  classification  problems).  Unfortunately,  these  results  do  not  help  much  In  practice  as  one 
sdll  must  struggle  with  determining  appropnate  numbers  of  hiddm  nodes/layers  and  accommodate  slow  training  (or 
hard-leaming)  when  dealing  wi-h  com^ex  applicatians  such  as  [Lee].  In  this  regard,  the  second  part  of  this  paper  con¬ 
centrates  on  exploring  how  one  can  possibly  facilitate  a  training  process,  or  avoid  hard-leaming  situations. 

This  p^ier  suggests  that  the  mapping  nonlinearity  which  the  training  samples  exhibit  should  be  an  ^ective  indicator 
for  the  d^ree  of  leamability  (or  trainability).  Moreova,  it  is  the  ^tial  characteristics  of  the  training  samples  that  is 
essential  to  the  mapping  nonlinearity,  fix  mtxe  so  than  merely  the  number  of  samfdes.  (In  this  regard,  the  work  of  [Koi¬ 
ran]  is  more  meaningful  than  [Baum]).  In  this  papa,  the  m^ing  nonlinearity  is  used  as  an  antonym  of  mapping  simi¬ 
larity,  Le..  the  similarity  between  inputs  and  ouqxits  of  mapping  samples.  The  normalized  object  distribution  vector 
(ODV)  is  introduced  as  a  generic  rqxesentatkm  of  a  mulddimensional  object  TWo  types  of  similarity  measurement  are 
suggested  based  on  ODV:  the  distribution  angle,  a,  and  the  maximum  distribution  gradient.  fUu-  These  schemes  are 
siqpotted  by  results  of  numerical  experiments,  including  a  continuous  one-to-one  nuqtping  examine. 

2.  Quan^’fying  Mapping  Nonlinearity 

Part  I  has  shown  that  during  the  sequence  of  rediaping,  die  training  object  is  so  twisted  that  its  topological  appearance 
becomes  imxe  and  more  smilar  to  that  of  the  desired  ouqiut  state.  If  such  similarity  between  any  two  states  can  be 
somehow  quantified,  the  same  schemes  can  be  used  to  gauge  the  similarity  (or  nonlinearity)  in  any  mapping  set  How¬ 
ever,  it  is  not  straightforward  to  quantify  the  similarity  between  any  two  states  of  a  training  c^ject.  for  die  dimensionali¬ 
ty  could  be  different  in  the  associated  states.  In  addition,  the  overall  size  of  the  object  is  scaled  up  and  down  by  stretch¬ 
ings  and  squeezings.  In  this  paper,  it  is  assumed  that  the  similarity  measure  is  independent  of  the  size  as  well  as  the 
dimenrionality  of  the  training  object.  Hence,  to  conduct  the  similarity  measure,  one  h^  to  first  rqxesent  the  state  of  the 
multidimensional  training  object  in  a  way  diat  is  independent  of  the  objea  size  and  dimensioiiality.  The  normalized 
object  distribution  vector  (ODV)  is  introduced  fix  this  purpose. 

Object  Distribution  Vector  (ODV) 

The  ODV  is  defined  as  a  vector  to  characterize  the  distribution  of  all  point  vectors  on  a  training  object  The  conponents 
(tf  an  ODV  are  obtained  by  sequentially  enumerating  all  distances  between  every  pair  of  training  points.  Given  a  train- 
ing  object  Voiy  that  is  rqiresented  by  n  point  vectors: 
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Hgure  1.  Euclidean  distance  vs.  Angle.  Figure  2.  A  hard  learning  case. 

(•)souce  state;  (b)aFter  stretching;  (c)  after  squeezing. 


Vobj-{Pili=i...n.Pie/?'"l 

then 

ODV(Yobj)  s  [dij  I  dij  =  distance  between  pi  and  pj;  Pi.  pj  E  Vobj;  i  >  j;  ij=/....n  ] . 

The  number  of  components  in  an  ODV  equals  n^n-/  )/2 .  where  h  is  the  number  of  point  vectors  in  the  associated  object 
It  is  imptxtant  to  note  that  the  ODV  rqxesentation  is  independent  of  the  dimensionality  of  object  an  ODV  essentially 
transforms  an  m-dimensional  object  that  has  n  point  vectors  into  a  point  vector  in  an  n(n-l)l2  dimensional  space,  re¬ 
gardless  of  the  dimensionality  m.  Moreover,  ODV  is  not  affected  if  the  entire  object  Vobj  is  translated  or  rotat^  in  the 
associated  ^tace.  If  all  components  in  Vobj  an  scaled  linearly,  the  associated  ODV  will  also  be  scaled  in  the  same  man¬ 
ner.  Hence,  a  normalized  ODV  is  a  repte^tation  that  is  invariant  to  the  location,  (nrientation,  and  size,  in  addition  to 
the  dimensionality,  of  its  associated  (Aject  Subsequently,  the  similarity  measure  between  two  states  of  a  training  ob¬ 
ject  becomes  the  similarity  measure  between  two  pdnt  vectcns  (the  associated  ODVs)  that  have  the  same  dimension. 

In  the  numerical  experiments  conducted,  it  was  found  that  the  ODV  can  be  modified  in  a  specific  way  to  improve  the 
performance  of  similarity  measure.  A  simj  '.e  approach  is  to  subtract  the  minimum  component  in  an  ODV  from  every 
component  (Le.,  all  components  are  down-shifti  so  that  the  minimum  is  0).  Such  a  modified  ODV,  denoted  as  MODV, 
is  nxxe  aiqirrqiriate  for  the  similarity  measure  with  binary  mappings. 

Distribution  Angle:  a 

The  Euclidean  distance  is  acommon  measure  of  the  similarity  between  two  point  vectors.[Kohonen]  Suppose  Vi  and 
V2  are  two  ODVs;  hj  ,  hz  rqxesent  the  nmmalized  vectors  of  V]  and  V2  reiqxxtively.  One  can  End  that  the  Euclidean 
distance  between  hj  112  (denoted  by  <4)  is  a  ftmetion  of  the  angle  between  V]  and  V2  (denoted  by  0).  This  relation¬ 
ship  is  explained  in  I  igure  1,  where 

COS0*Vi*V2/(IVillV2O 

ll7»V,/IVil 

ll2«V2/IV2l 

Illy  -  itjzl^  =  lilyP  +  lityP  -  Inj^nz 

=  l  +  l-2(Vi/lViO*(V2/IV2l) 

=  2(1  -  cos  0). 

i.c.,  <4,2  =  2(l-cos  0). 


Thereftxe,  the  similarity  measure  between  two  states  of  a  training  object  is  concqitually  equivalent  when  using  either 
the  Euclidean  distance  between  the  associated  normalized  ODVs,  or  the  angle  between  the  associated  ODVs.  The 
angle  measure  is  more  intuitive  and  is  adopted  in  this  research,  referred  to  as  the  distribution  angle,  a.  From  the  stand¬ 
point  of  reshaping,  the  smaller  the  a  between  two  states,  the  more  similar  the  two,  and  the  more  likely  that  the  trans¬ 
formation  betweoi  the  two  requires  fewer  number  of  stretchings  and  squeezings.  (The  distribution  angle  that  is  mea¬ 
sured  based  on  MODV  is  refored  to  as  mod^d  distribution  angle,  Om-) 


Distribution  Gradient:  p 

^rart  from  the  distribution  angle,  the  maximum  distribution  gradient  is  introduced  as  another  type  of  mapping  nonlin¬ 
earity  based  on  ODV.  One  can  identify  a  specific  hard  teaming  situation  as  explained  in  Figure  2.  Siqrpose  p; ,  ^repre- 
sent  two  points  in  the  source  state,  and  W  is  a  weight  vector.  I^ng  the  reshi^ing  operation  by  W,  the  two  points  ate 
stretched  to  pi*W  and  P2«W  first,  and  then  squeezed  to  jj(pi«W)  and fs(P2»W),  where ^  is  the  activation  function: 

/,(xj»//(7+e-*j.  (1) 

Let  <4,  <4  and  <4  denote  the  distance  between  the  two  points  at  the  source  state,  the  state  after  stretching  by  W,  and  the 
state  after  squeezing,  reflectively,  i.e.. 
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4*IPI-P2l. 

(2a) 

4«lpi*W-p2*W  1. 

(2b) 

and 

4  =  i/*(pi*w)  -j;fP2*w)  1. 

(2c) 

Equation  (2b)  can  be  rewritten  as 

4  >  1  Pi  -  P2 1  IWI  cos  6, 

(3) 

i.e.. 

IWI  S  - ^ - 

1  Pi  -  P2  1  cos  6  ■ 

(4) 

fs(x)=JJ(J-rrr*) 

r 

4“  o.i 

"1 
. 1 

Figure  3.  Relationship  between  dx  and  d,. 


The  relationshq)  between  dx  and  dt  is  explained  in  Figure  3.  For  an  object  with  length  dx,  the  maximum  length  after 
squeezing  should  be  dt"*^  =  2(fs(dxt2)  -  OJ),  which  is  obtained  when  the  center  of  the  object  coincides  with  the  origin. 
If  the  center  of  the  object  is  placed  elsewhm  other  than  the  origin,  the  length  after  squeezing  will  always  be  smaller. 
Therefore, 

d,  ^4'~*=2(r,f(4/2)-0.5).  (5) 


The  equality  holds  when  the  origin  coincides  with  the  center  point  between  pi  and  P2.  Substitute  (1)  into  (3)  and  get 

dx>2ln((l+d,y(l-d,)) 

Substitute  (2a)  attd  (6)  into  (4), 

2ln(iUd,)/(l-d,)) 

~  4f  cos  0 

Equation  (7)  shows  the  required  magnitude  of  the  weight  vector  that  can  reshape,  by  stretching  and  squeezing,  two 
points  from  source  state  to  target  state.  This  magnitude  depends  on  the  angle  6  and  the  distances  between  two  points  at 
source  and  target  states,  d,  and  4.  From  the  perspective  ctf  6,  one  sees  that  the  most  efficient  weight  vector  (i.e.,  with  the 
smallest  magnitude)  for  reshaping  the  target  points  is  in  the  direction  parallel  to  the  difference  of  the  two  point  vectors 
at  the  source  state  (i.e.,  when  9  =  0).  In  such  cases,  equation  (7)  becomes 


(6) 

(7) 


2ln{{Ud,)/(l-d,)) 
IWI  >  j 


(8) 


This  equation  shows  that  under  two  situations  the  magnitude  of  weight  vector,  IWI,  will  increase  dramatically:  (a)  when 
dt  approaches  1  for  a  fixed  ds,  and  (b)  when  <4  approaches  0  for  a  fixed  dt  (note  that  4  is  within  [0, 11  ond  4  >  0). 
Situation  (a)  is  less  sensitive  to  IWI  b^ause  the  v^ue  of  the  numerator  in  equation  (8)  increases  slowly  w  <  respect  to 
dt.  For  example,  with  tolerance  0.1%,  a  desired  4  of  value  1  can  accept  a  predicted  4  of  0.999.  The  numerator  is  then 
2hi((l't0.999)/(l-0.999))  =  15.2,  which  is  not  too  large.  The  major  concern  is  with  situation  (b),  which  shows  that  if  there 
is  a  pair  of  sample  points  whose  4  is  small  while  the  corresponding  4  is  not  small  enough,  it  will  require  a  very  large 
weight  vector  to  reshape  (i.e.,  separate)  the  two  points.  If  such  a  weight  vector  is  not  offered  by  any  weight  matrix  in 
MFN,  either  one  of  the  two  samples  will  never  be  learned  by  the  network,  i.e.,  one  or  the  other  will  be  a  stubborn 
sample.  On  the  other  hand,  if  one  purposely  includes  such  an  outstanding  weight  vector  in  a  weight  matrix  in  order  to 
learn  the  stubbtxn  samples,  it  is  likely  that  the  overall  performance  of  the  hyperstretching  will  be  deteriorated,  as  the 
outstanding  vectm'  may  tend  to  dominate  the  overall  direction  of  the  associate  hyperstretching.  Hence,  it  is  clear  that  a 
mapping  with  this  type  of  stubborn  samples  is  hard  to  learn.  Note  that  in  equation  (8)  the  4  could  equal  1 .  lb  avoid  the 
problem  of  division  by  0  and  only  emph^ize  situation  (b),  this  research  suggests  using  an  alternative,  the  ratio  between 
4  and  4>  as  a  rough  and  quick  estimation  of  equation  (8), 

IWI  >4/4.  (9) 

Equation  (9)  expresses  the  hard  learning  situation  as  when  the  ratio  between  dt  and  ds  is  large.  In  other  words,  it  is 
difficult  to  reshape  two  training  points  whose  distance  at  source  state  is  small  while  at  desired  output  state  is  large.  This 
is  consistent  with  one’s  intuition:  if  similar  inputs  yield  similar  outputs,  it  is  easy  to  learn;  if  similar  inputs  give  rise  to 
quite  diffoent  outputs,  the  learning  would  be  difficult.  Based  on  equation  (9),  the  Distribution  Gradient  Vector,  DGV, 
is  introduced  below  for  the  purpose  of  checking  the  existence  of  stubborn  samples.  The  "gradient"  refers  to  the  ratio  of 
the  similarity  (in  terms  of  distance)  between  the  source  state  and  the  target  state.  Let  Vs  and  Vt  represent  the  ODV  of  a 
training  object  at  source  state  and  target  state.  i.e., 

V,=  {si  I  Si  eR,  Si  >0,  i=l . «). 

V,=  (t,  It,  e/f,t,  2  0.i=l . n). 

Then  the  associated  DGV  is  deHned  as  the  vector  which  consists  of  the  ratio  between  every  pair  of  source  state  distance 
Si  and  target  state  distance  ti: 

DGV  (V„Vt)  =  {gi  I  g,-  UlSi,  U  eVt,  Si  eV„  1=7 . «}. 
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bi  this  reaearch.  the  g,-  is  convened  lo  an  angle  by  the  function  lan-^  (arc-tangent)  and  is  denoted  as  p.  The  maximum 
component  of  DGV  is  denoted  as  pb»...  and  is  mggested  as  the  second  type  of  measure  for  nuking  nonlinearity.  Em- 
piricaUy,  it  was  found  that  during  dte  preparation  of  training  samples,  one  should  avoid  a  large  (e.g.,  no  more  than 

a  dueshold,  88  degrees). 
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Figure  4.  Successive  Reshaping  Example  for  a  Ix2x2xl  MFN. 

n,  input  state;  b,  desired  output  state;  C,  predicted  output  state;  d,  input/output  mapping; 
e,  the  fust  mtermediate  stats;/,  the  second  intermediate  state. 
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Figure  5.  Distance  Matrix  and  ODV  for  the  input 
state  of  the  Ix2x2xl  example. 
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Table  2.  Disuibution  angles  for  the  Ix2x2xl  ex¬ 
ample.  (Inter;  and  Inter;  are  for  the  1st  and 
2nd  intermediate  state;  Ou^  and  Ou(;  are 
for  the  predicted  and  desired  output  state.) 


Table  1.  Raw  data  of  IS  training  samples 


3.  Numa’ical  Experiments  and  Discussions 

Extensive  numerical  experiments  have  been  conducted  to  validate  the  proposed  schemes  for  quantifying  mapping 
nonlinearity.  Selected  results  of  the  experiments  are  presented  hoe;  an  MFN  model  that  approximates  a  continuous 
one-to-one  mtqjping  is  used  to  illustrate  the  evolution  of  training  object.  In  addition,  the  distribution  angle  a  is  verified 
with  several  models. 

Evolution  of  lYaining  Object 

Figure  4  illustrates  the  successive  reshaping  of  a  one-to-one  and  continuous  mapping.  There  are  IS  training 
samples,  trained  by  an  MFN  with  a  configuration  lx2x2xl.The  use  of  two  hidden  layers  and  each  with  two 
hidden  nodes  was  intentional  -  in  this  manner,  the  intermediate  states  of  training  object  are  2D  and  therefore 
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Figure  6.  Evolution  of  ODVs. 

a  input  sUte;  e  die  first  intermediate  state;/ the  second  intennediate  state;  e  predicted  output  state. 


can  be  visualized  easily.  The  15  samples,  labeled  firom  it  to  o,  aie  listed  in  Table  1  with  the  ii^t  state,  the 
desired  output  state,  the  predicted  output  state,  as  well  as  the  two  intennediate  states  at  hidden  layers  (after 
stretching  and  squeezing). 

The  data  in  Table  1  is  shown  schematically  in  Figure  4.  where  there  are  6  curves  (including  3  straight  lines)  marked  as  a, 
b,  e,  d,  e  and/.  Curves  a,  b  and  e  represent  the  input,  the  desired  ouq)ut  and  the  predicted  output  states,  respectively  - 
they  are  shown  with  straight  lines  since  they  are  associated  with  ID  space.  The  minor  deviations  between  b  and  e  are 
ne^gible,  i.e.,  the  desired  ouqmt  and  predicted  ouq}Ut  ate  considered  equivalent  to  each  otho-.  Curves  d,  e  and  /are 
drawn  inside  a  unit  square.  Curve  d  shows  the  mapping  relationship  between  the  inputs  and  ouqruts  (solid  line  marks 
the  desired  ouqiuts  and  squares  mark  the  predict^  outputs).  By  t^ng  the  horizontal  axis  as  the  input  state  and  the 
vertical  axis  as  the  ouqtut  state,  one  can  visualize  the  mrqiping  relationship.  In  this  case,  it  shows  a  wave-like  form.  (In 
fact,  the  underlying  m^ing  function  is  in  the  fmm  of  y=jc-sinx.  The  samples  are  randomly  chosen  from  within  [0, 4rc] 
and  both  inputs  and  outputs  ate  normalized  to  [0, 1]).  Tlie  mapping  relationship  is  not  always  visualizable  -  it  is  easy  to 
visualize  (^y  when  the  ouq>ut  space  is  ID  and  tte  input  space  is  below  3D.  At  this  point,  one  is  advised  to  not  be 
distracted  by  the  mapping  curve,  since  the  mapping  relation  is  indeed  unrelated  to  the  explanation  of  successive  re¬ 
shapings.  Instead,  curve  e  and/are  what  need  to  be  emphasized:  e  and/correspond  to  the  first  and  second  intermediate 
state  of  the  training  object  respectively. 

In  Figure  4,  the  imaginary  ttaining  object  is  ^wn  with  a  single  dash  line  connecting  all  oaining  points  sequentially, 
from  point  a  to  point  o.  Note  that  on  curves  e  and/only  the  first  three  points,  a,  b  and  c,  are  labelled  and  one  can  easily 
trace  the  remaining  points.  Now  die  successive  reshaping  of  the  training  object  can  be  represented  by  four  curves  se¬ 
quentially,  i.e.,  a-*e  ->/-»  c.  One  may  observe  how  the  training  object  is  folded  and  twisted  from  the  input  state  to  the 
ouqnit  state  -  the  points  that  are  close  to  (or  far  away  from)  each  o^r  at  ouqiut  state  tend  to  get  closer  to  (or  further 
away  from)  each  other  during  the  reshaping  process.  This  is  exactly  the  evolutic  nciple  observed  in  the  previous 
section. 

The  reshaping  process  can  be  also  visualized  from  the  pospective  of  ODVs.  Figure  5  shows  how  the  ODV  associated 
with  input  state  is  obtained.  By  listing  ail  the  distances  between  every  pair  of  points,  one  can  come  up  with  a  distance 
matrix.  (Figure  5  only  shows  a  portion  of  the  entire  distance  matrix  for  the  input  state.)  Apparently,  this  matrix  is  sym¬ 
metric  and  its  diagonal  elements  are  all  0.  Hence,  essentially  the  ODV  is  ^fined  as  the  upper  (ot  lower)  triangtilar 
portitm  of  the  distance  matrix  (no  need  to  include  the  diagorial).  One  can  display  this  triangular  matrix  in  a  3D  view, 
e.g.,  by  attaching  two  axes  along  the  directions  of  row  and  column  of  the  matrix,  and  the  third  axis  to  show  the  magni¬ 
tude  of  elements.  In  this  manner,  the  ODVs  associated  with  curves  a,  e,/,  and  c  in  Figure  4  are  shown  in  Figure  6,  where 
each  ODV  is  shown  as  a  3D  view  as  well  as  a  2D  contour  plot.  The  sample  sequences  are  the  same  at  both  column  and 
row  directions,  i.e.,  from  point  a  to  point  o,  as  shown  in  Figure  S.  Such  sequence  resulted  in  a  single  smooth  hump  in  the 
3D  view  of  the  input  state  ODV.  Obviously,  the  appearance  of  ODV  is  dii^tly  affected  by  the  sample  sequence  in  both 
the  row  and  column  directions.  It  is  important  to  note  that,  in  Figure  6,  the  evolution  principle  can  be  recognized  more 
easily  in  view  of  the  heights  of  elements.  The  elements  that  are  high  (or  low)  in  output  state  tend  to  become  higher  (or 
lower)  during  the  resluqiing  process. 

The  distribution  anglea  is  defined  fra: measuring  the  similarity  between  two  ODVs.  Table  2  lists  the  a  values  between 
each  pair  of  ODVs  that  are  associated  with  the  five  states  of  the  Ix2x2xl  MFN.  The  notations  Inter/  and  Intaz  refer  to 
the  first  and  second  intennediate  states;  Ou^  and  OuW  refer  to  the  predicted  and  desired  output  states.  The  a  r,  listed  in 
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degree.  The  smaller  the  a,  the  intMe  similar  the  two  ODVs.  The  data  on  the  fust  row  exhibits  a  trend  of  increasing  (from 
left  to  right),  which  means  that  the  input  state  is  most  similar  to  its  adjacent  state,  the  first  intermediate  state,  and  be¬ 
comes  more  and  more  dissimilar  to  the  states  after  more  reshapings.  In  fact,  the  data  on  same  rows  and  columns  all 
follow  the  same  trend  (decreasing  from  top  to  bottom  in  a  column).  Therefore,  it  appears  that  a  is  a  rational  measure  for 
the  similarity  between  two  states  of  a  training  object. 

In  Ihble  3,  the  same  set  of  training  samples  are  trained  by  two  different  MFNs.  With  the  1x10x1  MFN,  a  follows  the 
uniform  increasing/decreasing  trend  well.  However,  with  the  I  x8x8xl  MFN.  the  trend  is  disrurbed  by  data  such  as  19.6 
and  39.7.  This  disturbance  must  be  due  to  the  fact  that  the  Ix8x8x  1  MFN  indeed  contains  too  many  redundant  hidden 
nodes  that  lead  to  the  undesired  over-shaping  effect  (recall  that  Ix2x2xl  is  sufficient  with  two  hidden  layers). 
Table  4  shows  a  for  a  binary  mapping  example,  the  4-bit  xor  problem.  The  training  samples  used  are  the  full  set  of 
mapping  points  (i.e..  2‘*=16  samples).  The  net  configuration  is  4x3x1.  Both  the  regular  distribution  angle  a  and  the 
modified  distribution  angle  Om  are  tabulated  and  the  convergence  trend  is  better  with  a,„.  In  fact,  in  several  other  ex¬ 
periments  with  boolean  mappings,  always  performs  better  than  a.  However,  for  continuous  mappings,  the  diHer- 
eticc  in  performance  between  a  and  is  not  significant. 

In  another  experiment,  the  4-bit  xor  problem  was  trained  by  a  4x2x1  MFN,  where  an  insufficient  number  of  hidden 
nodes  were  intentionally  used  (i.e.,  2  hidden  nodes)  in  order  to  obtain  an  under-fitting  intermediate  state.  It  was  found 
that  during  the  training  process,  the  maximum  distribution  gradient  p>nax  of  output  state  vs.  intermediate  state  increased 
sharply  when  the  training  process  was  reaching  a  saturation  point  -  in  terms  of  gradient  angle,  ^max  was  approaching 
90  degrees. 
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Table  3.  Distribution  Angle  a  for  MFNs  1x10x1  and  Ix8x8xl. 
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Table  4.  Distribution  Angle  a  and  ou  for  4-bit  xor  mapping  (MFN:  4x3x1) 


4.  Conclusion 

Based  on  the  geometrical  interpretation  presented  in  Part  I,  this  paper  introduced  the  normalized  object  distribution 
vector  (ODV)  as  a  generic  representation  of  the  multidimensional  training  objects.  This  representation  is  independent 
of  the  dimension,  size,  location  and  orientation  of  the  associated  object  Based  on  ODV,  two  types  of  measurement  are 
suggested  to  gauge  the  mapping  nonlinearity  between  any  pair  of  source/target  states.  The  first  type  is  the  distribution 
angle  a  and  the  second  is  the  maximum  distribution  gradient  Pmu-  With  a  and  one  can  then  try  to  reduce  the 
mapping  nonlinearity  within  the  mapping  samples  so  as  to  facilitate  the  training  process  or  avoid  hard-Ieaming  situa¬ 
tions. 
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Abstract 

We  describe  an  extension  to  the  Associative  Reward-Penalty,  or  Ar-p,  algorithm  for  solv¬ 
ing  nonlinear  supervised  learning  tasks  utilising  multi-layer  feed-forward  networks.  We  in¬ 
troduce  a  variant  of  the  Ar-p  algorithm,  called  the  ’Unbounded’  reinforcement  Ar-p-  The 
method  utilises  a  quantised  real-valued  reinforcement,  which  is  a  payoff  metric  optimised  by 
an  associated  Critic  Net. 


1  Introduction 

The  underlying  principle  of  the  Associative  Reward-Penalty,  Ar-p,  algorithm  is  that  a  binary 
(scalar)  reward  signal  is  broadcast  globally  across  a  network.  The  reinforcement  signal  "r”  is 
then  utilised  hy  each  unit  in  the  net,  to  determine  their  weight  updates.  The  premise  is  that  the 
stochastic  nodes  in  the  net  are  given  credit  or  reinforcement  if  the  net  gives  a  ’successful’  output. 
The  net  is  given  a  debit  or  penalized  if  its  output  is  wrong  [1].  The  initial  research  of  Barto  [2] 
defines  the  reward  signal  r  €  (0, 1]  in  his  P-model  as: 

r  =  I  with  probability  1  —  or  r  =  0  with  probability  e®  (1) 

where  the  error  Co  is  the  mean-square  output  error  of  the  net. 

This  means  the  reinforcement  is  deduced  solely  as  a  function  of  the  output  error  and  the  present 
input  stimuli,  which  we  term  primary  information.  The  scalar  reward  has  a  very  low  information 
content  and  as  such  can  not  give  credit  to  a  good  action  as  precisely  as  Barto’s  [2]  S-model,  where 
the  reward  signal  0.0  <  r*  <  1.0  is  a  real-vadued  variable,  defined  by: 

r"  =  1  -  Co  (2) 

The  S-model  requires  a  large  bandwidth  signal  to  be  broadcast  to  all  the  units  in  a  net.  This 
has  significant  repercussions  if  one  is  considering  mapping  the  S-model  to  hardware,  as  the  reward 
signal  would  not  be  a  single  binary  control  line,  if  one  considers  a  network  as  being  supervised 
by  an  external  training  environment  (^)  that  provides  input  stimuli  to  the  network  and  monitors 
the  output  action  of  the  net.  It  is  of  interest  to  note  that  the  Ar-p  algorithms  do  not  utilise 
secondary  information,  such  as  past  data  obtiuned  from  the  environment  R.  In  this  paper,  we 
describe  an  extension  to  the  Ar-p  algorithm  which  uses  secondary  information  which  is  based 
on  tracing  the  frequency  of  ’stimuli’  occurrence  and  then  using  this  to  derive  a  prediction  of  the 
reinforcement. 
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2  The  Sigma-pi  Neuron  Model 

The  neuron  model  we  utilise  has  previously  been  termed  a  Sigma-pi  unit  [3]  ,  these  units  are 
similar  to  pR'AM  units  [4]  and  as  they  are  RAM  based  they  may  be  placed  in  the  same  category 
as  r*LN  units  [5]  or  the  more  recent  GNU  units  [6].  We  use  the  stochastic  model  Direct  Output 
Node  (DON).  The  activation  of  the  DON,  for  the  Analogue  case,  is  defined  as: 

“  =  ^  (3) 

^  e  »=i 

where  Zi  defines  a  set  of  probability  distributions  for  an  input  address  formed  from  a  set  of  Boolean 
variables  {A',},  given  by 

P^(xi) H^Zi)  (4) 

Given  x  €  {xi ,  acj,  is  a  binary  input  vector  which  may  be  represented  as  a  set  of  bits  in 
positions  zi  to  x,-.  The  site  address  /x  €  is  represented  by  a  set  of  bits  in  positions 

fi\  to  /i,-.  The  site  value  is  addressed  by  the  binary  string  /x.  The  site  value  stores  a  value 
Sft  e  {-^mi  •ytn}*  Then  for  the  stochastic  model 

a=^^<riS^)Piti)=<<r(S,)>  (5) 

M 

The  output  y  of  the  DON  is  defined  as  equal  to  the  activation  a  and 

P(jj=l|,,)  =  <r(5^)= - *  (6) 

1  +  e  e 

Then  the  output  y  =<  >.  The  output  behaviour  of  these  units  is  similar  to  that  of 

Boltzmann  units  [7]. 

3  TVaining  Artificial  Neural  Networks  by  Error  minimisation 

The  goal  of  the  learning  regime  is  to  minimise  a  mean-square  output  error  term; 

=  (7) 

where  [.]^  is  the  square  error  per  input  stimuli,  defined  on  the  output.  This  is  summed  over  all 
Nv  output  units  or  visible  units.  The  sum  is  over  the  set  of  these  visible  units.  The  error  is 
the  difference  between  the  target  response  yf  of  output  j  for  a  given  input/output  pattern  pair, 
and  the  sigmoidal  value  of  the  site  (r(5^),  where  fi  specifies  a  site  address. 

3.1  Unbounded  Reinforcement  Ar-p  learning 

The  external  reward  has  been  previously  defined  in  (1),  where  €  [0, 1]  is  a  binary  scalar  value. 
The  external  reinforcement,  in  the  case  of  unbounded  Reinforcement,  is  then  scaled: 

r(,)  =  (2  ♦  r(,))  -  1  (8) 
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where  e  [—1,  +  1]  is  of  the  form  utilised  by  Barto,  Sutton  and  Anderson  [8].  The  scaled  reward 
signal  is  then  used  to  derive  an  improved  or  internal  reinforcement  signal,  given  by: 

^(0  (9) 

where  /'„(,)  is  the  present  prediction  and  the  past  prediction.  It  should  be  noted  that  this 

is  not  the  same  as  Barto’s  [8]  original  work,  where  he  uses  the  prediction  values  and 
We  use  the  present  and  previous  prediction  values  for  the  given  site  address  v.  The  coefficient 
0.0  <  7  <  1.0  has  previously  been  termed  the  ’’discount  factor”  by  Barto  et.  al.  [8].  The 
prediction  value  is  updated  by: 


A /*„(..+,)  = /if  (10) 

where  0  <  /i  <  1  is  a  positive  constant  deteimining  the  rate  of  change  of  All  the  input 

eligibility  traces  are  updated  using: 

+  ( 1  -  ( n ) 

for  all  input  addresses  0  <  u  <  tf,  where  r;  is  the  maximum  input  address  (i.e.  for  an  8-tuple 
r/  —  [(2*)  —  1]  or  25.'5  decimal  or  FF  hexidecimal),  and  where  lambda  0  <  A  <  1  determines  the 
eligibility  traces  decay  rates.  The  binary  value  x„  is  a  trigger  for  the  eligibility  trace,  and  when 
the  site  v  is  addressed  x„  =  1  and  all  other  non-addressed  traces  are  updated  with  Xjj  =  0.  The 
internal  reinforcement  f^jj  is  then  re-scaled 

’•(.)  =  ^(^+1-0)  (12) 

which  denotes  a  quantised  real-valued  reinforcement  —  1.0  ^  <  -1-2.0,  that  is  defined  as  the 

unbounded  internal  reinforcement  which  permits  penalisation  even  when  A  =  0. 

The  net  is  then  trained  substituting  which  is  the  unbounded  internal  reinforcement, 

while  the  standard  Ar^p  regime  utilises  r  defined  in  (1).  Then  each  node  j,  given  site  address 
fi,  updates  its  site  value  according  to  the  foUowing  equation: 


A.?i  =  a[y^'  -  -f-  aA[l  -  -  (r(5i)](l  -  r(j))  (13) 


4  Discussion  of  Theory 

In  the  original  work  of  Barto  [2]  he  utilises  a  scalar  Reinforcement  signal.  In  the  above  we  replace 
this  with  a  quantised  Reinforcement  signal  based  on  the  present  external  Reinforcement  and  past 
and  present  prediction  veJues.  We  utilise  the  Adaptive  Critic  Element  (ACE)  of  Barto,  Sutton 
and  Anderson  [8]  to  maximise  over  time  by  maximising  in  the  immediate  future.  Barto’s 
method  may  be  thought  of  as  a  ’temporal  difference’  (T.D.)  method  [9]  as  he  utilises  data  that 
relates  to  the  past  and  present  events  to  enable  a  payoff  metric  to  be  optimised,  where  the  payoff 
was  used  as  a  "prediction”  or  "expectation”  of  a  future  Reinforcement  [10] .  The  prediction  values 
are  calculated  with  reference  to  the  ACE’s  input  eligibility  traces,  where  the  eligibility  is  a  trace 
of  events  over  time  [8]. 

The  eligibility  trace  may  be  described  as  follows;  given  a  pathway  between  two  neurons,  the 
pathway  is  said  to  reach  maximum  eligibiUty  a  short  time  after  the  occurrence  of  a  nonzero  input 
signal  on  that  pathway.  The  input  eligibility  traces  are  averages  (x),  where  the  bar  (7)  denotes  an 
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exponential  average  over  time.  Each  input  to  the  Adaptive  Critic  Element  is  given  its  own  trace 
in  Barto’s  original  study. 

In  our  case  we  store  the  eligibility  values  in  an  rt-tuple.  The  input  eligibility  is  interpreted 
as;  given  an  ’i’  bit  input  vector  •—*«}»  which  addresses  location  ’v’  in  an  eligibility  n- 

tuple,  giving  an  eligibility  (0  ^  [0,  +a:„],  that  is  specified  as  a  V’  bit  number,  having  D  =  x„  +  1 
discrete  levels.  Hei.ce  if  x,,  =  8  then  xu^.j  =  {O.riJin  |  n  =  0,  1,..  N)  where  N  =  l.C/O.ri.'j  =  x„. 
These  input  eligibibty  traces  increase  when  the  inp.it  is  active,  and  decrease  to  zero  with  time  in 
the  absence  of  future  activity. 

The  adaptive  critic  is  utilised  to  predict  an  internal  Reinforcement,  the  procedure  the  adaptive 
critic  follows  is;  given  an  external  reward  signal  at  time  t,  the  critic  then  deduces  an  internal 
Reinforcement  signal  based  on  the  external  Reinforcement  rjjj  and  the  present  and 
past  /’„((_!)  predictions.  The  future  prediction  value  is  then  derived  as  a  function  of  the  input 
eligibility  trace.  Finally  all  the  input  eligibility  traces  are  updated.  One  should  note  that  the 
predictions  are  quantised  and  stored  in  an  n— tuple  in  the  same  manner  as  the  eligibility  trace. 
Where  €  [0,  +/u],  giving  D  =  Pn  +  1  discrete  levels,  which  are  stored  as  a  g  bit  number. 
Hence  if  /*„  =  8  then  /*(.)  =  {0.12.'j»i  |  n  =  0, where  N  =  1.0/0.125  --  P„.  The  internal 
reinforcement  is  calculated  using  /’(q  and  /’(t-i),  hence  the  unbounded  reward  signal  is  defined 
as  rJ'j  6  [— r,i,+2r„]  giving  D  =  3r„  +  1  discrete  levels,  which  is  a  g  bit  number.  Then  the 

unbounded  reward  utilised  in  (13)  is  defined  as  Tq  =  {0.125n  |  ii  =  —A,  ...0,1, . 2A},  given 

N  =  r,i  and  for  all  our  simulations  =  x„  =  r„  =  8. 

5  Simulation  Results 

5.1  The  8-3-8  encoder 

We  utilise  the  8-3-8  encoder  of  Hinton  et.  al.  [11],  which  they  used  for  their  research  into  the 
Boltzmann  machine,  it  is  a  simple  abstraction  of  the  recurring  task  of  communicating  informa¬ 
tion  among  various  components  of  a  parallel  network.  We  use  this  to  benchmark  the  learning 
algorithms,  because  it  is  clear  what  the  optimal  solution  is  and  it  is  non-trivial  to  discover  it. 

The  encoder  is  made  up  of  two  groups  of  visible  units,  designated  vl  and  v2,  representing  the 
two  systems  that  wish  to  communicate  their  states.  It  should  be  noted,  that  the  vl  units  are 
passive,  just  used  to  communicate  their  inputs  to  the  next  layer  of  the  encoder.  Each  group  has 
V  units.  In  the  simple  formulation  we  consider  here,  ul  and  v'2  are  not  directly  connected  but 
both  are  connected  to  a  group  of  H  hidden  units,  with  ^  <  K,  so  h  may  act  as  a  limited  capacity 
channel  through  which  information  about  vl  must  be  transmitted  with  optimal  coding.  In  all 
our  simulations  we  begin  by  setting  all  the  site  values  at  the  start  of  the  training  to  .9^  =  0,  then 
(t{S^)  =  0.5,  giving  P{Y  =  1  |  ^)  =  0.5,  i.e.  50%  probability  of  the  output  Y  obtaining  a  value 
”1”,  i.e.  no  prior  information  has  been  bestowed  on  the  network.  Hence  finding  a  solution  to  such 
a  problem  requires  that  the  two  visible  groups  come  to  agree  upon  the  meaning  of  a  set  of  codes 
without  any  prior  conventions  for  communicating  through  h. 

5.2  Experimental  Delimitations 

The  results  presented  show  a  graph  of  the  error  c„  where 
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is  the  average  error  over  Nm  =  100  trained  networks,  after  COOO  training  cycles  have  elapsed,  over 
all  Np  training  patterns,  where  is  the  mean-squared  output  error  (7)  of  each  training  vector. 
The  training  vector  set  used  were  hexidecimal  numbers  {F0,78,3C',  r£;,0F,87,6'3,  £;i},  hence 
Np  =  8.  The  training  set  was  randomly  ordered  for  each  sample  and  a  different  seed  was  given  to 
the  stochastic  operator  of  the  net  at  the  start  of  each  training  session.  The  training  vectors  each 
have  four  adjacent  set-bits.  This  means  that  there  are  192  valid  codes,  which  represent  0.0011% 
of  all  possible  code  solutions.  For  all  the  experiments  p  =  0.3,  A  =  0.0,  S,n  €  [-10, -hlO]  and 
Pn  ~  in  —  —  8. 


Error 


iMiing  rote 

Figure  1  Average  log  error  Co  versus  log  a,  for  Sigma-pi  based  838-encoder 
with  8  training  vectors  having  four  set-bits.  (The  graph  shows  the  average  er¬ 
ror  €o,  over  100  nets,  after  the  networks  have  been  trained  for  6000  cycles.  The 
lighter  solid  line  shows  the  error  for  the  standard  scalar  r  G  [0, 1]  Reinforcement 
Afi-p.  The  heavy  solid  line  shows  the  error  for  the  unbounded  internal  reinforce¬ 
ment  -1.0  <  r(t)*  <  -1-2.0.) 

The  plot  of  log  error  Co  against  log  o  is  shown  in  Figure  1.  The  learning  rates  used  were 
a  =  0.1,0.2.'3, 0.5, 1.0, 2.0, 5.0, 10.0  &  20.0  .  The  graph  shows  that  unbounded  Ap^p  reduces  the 
average  percentage  error  over  all  eight  learning  rates,  when  compared  with  standard  Ap-p,  by 
10%. 


6  *  Concluding  Remarks 

The  unbounded  Reinforcement  Associative  Reward  Penalty  Ap-p  gives  increased  efficiency  of 
training  when  compared  to  the  standard  Ap-p.  This  we  hypothesize  is  due  to  the  fact  that  the 
unbounded  reward  signal  is  able  to  reward /penalise  the  net  to  a  higher  degree.  If,  for  example, 
the  external  reward  is  a  penalty  signal  and  the  temporal  difference  between  the  predictions  is  a 
negative  quantity  (i.e.  then  the  internal  reinforcement  is  reduced,  and  the  net  is 

then  penalised  to  a  greater  degree.  The  converse  is  also  true,  as  the  internal  Reinforcement  would 
be  increased  if  the  external  Reinforcement  signal  is  a  reward  and  the  temporal  difference  between 
the  predictions  is  positive  (i.e.  >  P„(t_i)  ).  It  is  of  interest  to  note  that  the  unbounded 

Ap^p  training  methodology  permits  penalisation  of  the  net  even  when  the  penalty  coefficient  is 
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set  to  zero.,  as  the  internal  reward  signal  may  be  negative,  normally  the  net  is  only  penalised  if 
the  penalty  coefficient  A  is  non-zero. 
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Abstract 

Fuzzy  encoding  is  the  process  of  determining  the  respective  degrees  to  which  a  datum  belongs  to  a  collection  of 
fiiz^y  sets  and  subsequently  using  these  membership  grades  in  place  of  the  original  datum.  This  procedure  is 
similar  to  1-of-n  intervalization  encoding  except  that  gradual  transitions  occur  at  interval  boundaries. 

This  paper  examines  the  efficacy  of  fuzzy  encoding  the  input  data  presented  to  artificial  neural  networks 
employing  the  back-propagation  algorithm.  A  general  problem  is  described  and  defined  in  two,  three,  four,  and 
twenty  dimensions.  Performance  results  obtained  from  two  groups  of  trained  artificial  neural  networks  are 
compared  and  contrasted:  one  group  used  non-encoded  data  and  the  other  used  the  corresponding  fuzzy  encoded 
data.  The  netwoiks  using  the  fuzzy  encoded  data  consistently  attained  superior  classification  rates  compared  to 
their  non-encoded  counterparts.  Moreover,  these  results  were  achieved  using  significantly  fewer  iterations. 

Finally,  performance  results  obtained  using  this  process  on  a  set  of  “real-world”  data,  namely,  1 -dimensional 
magnetic  resonance  spectra  of  thyroid  biopsies,  are  discussed  and  compared  with  results  obtained  using  other 
techniques.  Once  again,  the  fiizzy  encoded  networks  outperformed  the  corresponding  non-encoded  networks. 
However,  when  some  conventional  enhancements  were  made  to  the  networks,  the  performance  of  the  non-encoded 
netwOTks  improved  appreciably  while  the  fuzzy-encoded  networks  sufiered  some  performance  degradations. 

1.  Introduction 

The  artificial  neural  network  (ANN)  paradigm  has  consistently  demonstrated  its  effectiveness  as  a  robust 
classification  technique.  The  back-propagation  network  (BPN)  [1]  has  served  as  a  workhorse  and  a  touchstone  for 
many  fruitful  inquiries.  This  paper  investigates  the  utility  of  fuzzy  encoding  as  a  preprocessing  method  for  BPNs. 

The  BPN  architecture  that  is  used  in  this  investigation  has  the  following  characteristics.  The  transfer  function, 
tr,  is  the  logistic  function, 

fr(x)  =  (l +«-*)-'  (1) 

and  the  global  error  function,  E,  is 

£  =  0.55;(W*-Ot)*)  (2) 

k 

where  the  dj^’s  and  Oj^’s  are  the  respective  components  of  the  desired  and  actual  outputs  and  the  weight  changes  are 
calculated  using  the  standard  gradient  descent  strategy 

(3) 

where  a,  the  learning  coefficient,  is  set  to  0.9.  No  momentum  term  is  used. 

Fuzzy  set  theory  is  an  extension  of  Boolean  set  theory  developed  by  Zadeh  [2].  Fuzzy  encoding  involves  taking  a 
single  input  value  and  intervalizing  it  across  a  collection  of  fuzzy  sets,  thereby  producing  a  list  of  degrees  of 
membership  for  each  of  the  fuzzy  sets.  In  other  words,  if  we  have  n  fiizzy  sets  and  fj  is  the  membership  function  for 
the  i^  fiizzy  set  then  the  list  of  values  for  an  input  value  x  is  {fjCx),  f2(x), ...,  f„(x)}.  Selecting  intervals  for  the 
fiizzy  sets  is  usually  an  experimental  or  heuristic  process  and  is  similar  to  the  techniques  used  in  standard  1-of-n 
intervalization  encodings.  The  purpose  of  intervalization  is  to  reduce  the  effects  of  noise  in  the  data  as  well  as  to 
transform  the  problem  in  such  a  way  that  a  non-linear  regression  model  such  as  BPN  can  provide  better  solutions. 
The  fuzzy  membership  functions  are  easily  defined  once  the  intervals  have  been  selected  because  the  definition 
corresponds  to  1-of-n  intervalization  with  the  addition  of  gradual  transitions  at  the  respective  interval  boundaries. 

2.  The  CLASSincATiON  Problem 

Data  were  generated  that  fell  into  two  classes;  those  points  that  were  bounded  by  a  set  of  hyperplanes  and  those 
that  were  outside  the  region.  Figure  la  illustrates  the  problem  in  two  dimensions.  A  point,  (xj,  X2),  is  considered  to 
be  class  1  if  -0.7S<Xi<0.75  and  -0.7S<X2<0.75,  other^se,  it  belongs  to  class  0.  Four  lines,  HI  through  H4, 

perfectly  separate  the  two  classes.  For  an  n-dlmensional  problem,  a  point  (X],  X2 . x„)  is  considered  to  be  class  1 

if  -0.7S<X{'^.75  for  all  ial,  2, ....  n  or  class  0  otherwise.  Further,  2n  hyperplanes  will  p^ectly  separate  the  two 
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classes.  In  the  ideal  case,  a  BPN  will  find  the  2n  hyperplanes.  It  should  be  noted  that  (at  least)  2n  processing 
elements  (PEs)  in  the  hidden  layer  are  needed,  corresponding  to  the  2n  hyperplanes. 

However,  in  practice  a  BPN  may  not  find  these  hyperplanes.  Figure  lb  illustrates  a  suboptimal  solution  for  the  2- 
dimensional  problem  using  three  lines.  In  this  case,  one  of  two  events  will  have  occurred:  one  of  the  hidden  PEs 
will  have  weights  that  are  similai-  to  one  of  the  other  three  PEs  in  the  hidden  layer  (in  which  case  it  will  duplicate 
the  functionality  of  the  other  PE);  or,  the  weights  of  one  of  the  PEs  are  near  zero  in  which  case  it  contributes 
negligibly  to  the  outcome.  It  should  be  noted  that  even  when  only  three  hyperplanes  are  used,  a  BPN  may  converge 
to  a  point  where  a  majority  of  the  vectors  will  be  correctly  classified.  However,  this  benefit  may  also  be  considered 
a  disadvantage  -  when  it  begins  to  converge  to  a  solution,  a  BPN  is  not  able  to  escape  from  the  associated  local 
minimum  to  determine  if  better  solutions  exist.  This  is  a  result  of  the  gradient  descent  strategy  —  the  error  cannot 
increase,  thus  when  the  algorithm  begins  to  converge  towards  a  solution  it  cannot  diverge  from  it. 


Figure  1 :  a)  The  Problem  in  Two  Dimensions  b)  A  Non-Ideal  Solution 

3.  Conventional  enhancements  to  the  BPN 

A  number  of  enhancements  may  be  made  to  BPNs  that:  increase  the  rate  of  convergence;  increase  robustness;  or 
improve  the  accuracy  of  the  final  results  [3].  Using  the  hyperbolic  tangent  function  as  the  transfer  function  instead 
of  the  logistic  function  typically  improves  the  performance  of  a  BPN.  The  transfer  function’s  output  is  a  multiplier 
in  the  weight  update  formula.  The  logistic  function’s  range  of  [0,  1]  may  cause  a  bias  towards  learning  larger 
values.  However,  the  hyperbolic  tangent  function  is  bipolar,  hence  this  will  not  occur.  A  gain  term,  g,  may  also  be 
introduced  into  the  sigmoid, 

/r(x)  =  (l +  «"'*)■'  (4) 

A  large  gain  value  may  increase  the  rate  of  convergence  but  at  the  same  time  makes  the  BPN  more  susceptible  to 
pitted  error  sui  faces  and  may  cause  wild  oscillations  during  learning. 

Different  learning  and  momentum  rates  may  be  used  for  each  layer  and/or  after  each  of  a  set  of  predetermined 
number  of  iterations.  A  typical  scenario  is  to  use  large  learning  and  momentum  values  for  the  initial  layers  and/or 
the  initial  sets  of  iterations  and  successively  smaller  values  for  subsequent  layers  and/or  sets  of  iterations.  The  end 
effect  of  this  modulated  learning  strategy  is  to  search  for  gross  data  features  at  the  initial  layers  and/or  during  the 
initial  sets  of  iterations  and  successively  refine  these  detected  features  by  subsequent  layers  and/or  sets  of  iterations. 

A  number  of  preprocessing  techniques  may  also  be  applied  to  the  data  before  presentation  to  a  BPN.  Data  may  be 
scaled  and  normalized  in  order  to  avoid  saturation  of  the  sigmoid  by  large  input  values  (with  respect  to  other  input 
values).  Uniform  or  gaussian  noise  may  also  be  added  to  the  ANN  in  order  to  make  the  system  more  robust. 
Principal  component  analysis  may  be  performed.  Fuzzy  encoding  falls  into  the  preprocessing  category  of 
enhancements  and  its  efficacy  will  be  examined  in  section  6. 


4.  Ideal  Solutions 

Figure  la  suggests  that  the  ideal  solution  for  the  n-dimensional  problem  requires  exactly  2n  hyperplanes.  If  a 
step  function  is  used  as  the  transfer  function 

=  (5) 


then  the  solution  is  straightforward.  For  each  dimension,  i,  we  have  a  pair  of  hidden  PEs  corresponding  to  the  pair 
of  hypcrplancs  used  for  that  dimension.  The  weights  for  the  corresponding  coordinate,  xj,  are  set  to  1 . 1'he  weights 
arc  set  to  0  for  the  remaining  coordinates.  The  weight  value  between  the  first  PE  and  the  output  node  is  1  and  -1 
for  the  second.  The  bias  for  the  first  PE  is  0.75  and  -0.75  for  the  second.  Finally,  the  bias  for  the  output  PE  is 
-(n-c),  where  c  is  a  small  real.  If  Xj  is  bounded  by  the  corresponding  hyperplancs  then  the  sum  of  the  pair  of  PEs  is 
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large,  otherwise,  it  tends  towards  0.  If  all  coordinates,  xi,  X2, ...,  Xn,  are  bounded  by  their  respective  hyperplanes 
then  the  sum  of  the  outputs  of  the  2n  hyperplanes  is  large. 

Of  course,  a  BPN  cannot  use  the  step  function  as  a  transfer  function  because  the  gradient  descent  strategy 
requires  a  differentiable  transfer  function.  Moreover,  because  the  logistic  function  produces  continuous  values 
between  0  and  1 ,  it  smoothes  the  output  values  instead  of  providing  a  discrete,  non-continuous  jump  from  0  to  1 . 
The  smoothing  nature  of  the  sigmoid  tends  to  affect  the  results  such  that  data  points  near  the  boundaries  become 
misclassified.  One  way  to  compensate  for  this  is  to  use  a  gain  term  with  the  logistic  function.  As  the  gain  term 
approaches  infinity  the  logistic  function  tends  towards  a  step  function.  Unfortunately,  large  gain  terms  usually 
cause  the  BPNs  to  wildly  oscillate.  However,  if  we  use  the  logistic  function  without  any  gain,  we  can  still  get  an 
ideal  solution  if  we  change  the  bias  values  and  input  weights  for  the  hidden  PEs  (figure  2).  In  fact,  the  larger 
values  (two  orders  of  magnitude)  tend  to  produce  the  same  results  as  those  where  a  large  gain  term  is  used.  The 
advantage,  though,  is  that  this  approach  does  not  tend  to  cause  wild  oscillations. 


Figure  2:  An  Ideal  2D  Solution  Using  the  Logistic  Function 


5.  Fuzzy  Encoded  BPNs 

In  the  experiments,  four  triangular  fuzzy  sets  were  selected  at  intervals  of  [-l,-0.5],  [-0.5,0],  [0,0.5],  and  [0.5,  1], 
respectively.  The  fuzzy  membership  functions  are: 

/,(jc)  =  0v(l-2|jr+.75|),  /2(x)  =  0v(l-2|x+.25l),  ^(jc)  =  0v(l-2|x-.25|),  ^(x)  =  0v(l-2|x-.75|)  (6) 
where  x  is  the  input,  and  v  and  a  are  the  max  and  min  operators,  respectively.  Figure  3  shows  the  architecture  of  a 
fuzzy  encoded  BPN  comparable  to  the  non-cncoded  BPN  shown  in  figure  2.  Additional  runs  were  made  using  8 
triangular  fuzzy  sets  for  each  input  value  (sec  section  6).  It  is  fairly  straightforward  to  derive  a  formula  to  generate 
a  collection  of  fuzzy  membership  functions.  First,  select  the  number  of  fuzzy  sets,  n,  that  are  to  be  used.  Let be 
the  left  boundary  and  q  the  right  boundary  of  the  i*  fuzzy  set.  Let  b  be  the  boundary  value  at  the  intersection  of  the 
fuzzy  sets.  For  simplicity,  b  is  constant  for  each  intersection.  Let  w  be  the  width  of  the  top  of  the  trapezoid  of  the 
fuzzy  sets.  Finally,  let  x  be  the  non-encoded  input  value.  Then, 

/;(x)  =  lA(0v(l-»-w-2i—y^|x-0.5(/;  ■(•>;)[))  (7) 

It  should  be  noted  that  if  w=0  then  the  fj’s  correspond  to  triangular  fuzzy  sets  (see  figure  4). 


Figure  3:  A  BPN  with  Fuzzy  Encoding 


When  b  is  at  least  0.5  then  there  exists  a  strict  1-1  correspondence  between  the  fuzzy  encoding  and  the  original 
input  value.  Since  a  particular  fuzzy  encoding  can  be  produced  by  only  one  input  value,  the  fuzzy  encoding  of  the 
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data  does  not  change  the  nature  of  the  problem.  If  b  is  less  than  0.5  then  we  have  a  1-many  correspondence  and 
the  information  content  of  the  fuzzy  encoding  is  reduced  and  hence  the  nature  of  the  problem  is  changed. 
Furthermore,  because  of  the  relationship  across  each  fuzzy  set,  the  encoding  does  not  introduce  any  extra  degrees 
of  freedom  into  the  problem.  As  a  matter  of  fact,  there  are  situations  when  the  dimensionality  of  our  specific 
problem  can  be  reduced.  Also,  even  though  more  connections  are  introduced  into  the  BPN  than  with  the  associated 
NE  experiments,  no  additional  data  are  required  to  train  it. 


Figure  4:  Fuzzy  Set  Construction 
6.  Experimental  Details 

The  generated  data  were  neither  scaled  nor  normalized.  For  each  specific  n-dimensional  problem,  one  hidden 
layer  was  used  that  contained  2n  PEs.  After  some  initial  trials,  the  number  of  iterations  was  fixed  for  each  set  of 
experiments  in  order  to  more  accurately  compare  the  performance  of  a  BPN  using  non-encoded  (NE)  data  versus 
the  corresponding  BPN  using  fuzzy  encoded  (FE)  data. 

The  data  range  for  the  classification  problem  is  [-1, 1]  and  is  discretized  in  intervals  of  0.1.  Apart  from  ensuring 
that  vectors  were  randomly  selected  from  the  entire  pool,  the  overriding  constraint  was  to  ensure  that  there  was  an 
equal  number  of  class  0  and  class  1  vectors  in  the  training  sets.  For  each  2-,  3-,  4-,  and  20-dimensional  case,  1 00 
training  and  testing  sets  were  generated  in  order  to  provide  a  more  statistically  accurate  set  of  observations.  Each 
set  was  then  fuzzy-encoded  and  paired  with  its  corresponding  non-encoded  set.  For  each  experiment,  a  fixed 
number  of  iteration  was  used.  After  the  training  phase  stopped,  the  test  sets  were  run  through  the  BPNs  to 
determine  how  well  they  performed.  The  weights  were  also  recorded  for  subsequent  analysis.  For  purposes  of  this 
discussion,  some  representative  experiment  pairs  were  selected  from  the  2-dimensional  cases. 

In  the  2-dimensional  case,  the  NE  version  of  experiment  87  (figure  5a)  that  yielded  perfect  cidssifications,  is 
very  similar  in  structure  to  the  BPN  found  in  figure  3.  That  is,  the  relative  magnitudes  '4re  similar  and  the  signs 
identical  for  each  respective  weight  and  bias  value.  This  suggests  that  each  hidden  PE  corresponds  to  a  unique  and 
significant  hyperplane.  The  NE  version  of  experiment  31  (figure  5b)  produced  an  accuracy  rate  of  86%.  Note  that 
the  PE,  H4  (shaded),  contributed  little  to  the  final  outcome.  In  this  case  only  three  hyperplanes  are  used  thereby 
degrading  o%  erall  performance.  The  NE  version  of  experiment  23  produced  poor  results.  This  is  to  be  expected 
since  three  of  the  hidden  PEs  duplicate  the  functionality  of  the  remaining  hidden  PE  and  this  implies  that  only  one 
hyperphne  is  used.  In  the  FE  versions  of  all  the  experiments,  perfect  results  were  achieved.  The  structures  of  the 
corresponding  BPNs  suggest  that  the  information  content  is  more  uniformly  distributed  through  each  BPN. 

Tables  li-iv  list  the  overall  classification  rates  (averaged  over  1(X)  runs)  and  the  iteration  count  for  several 
different  experiments.  In  all  cases  the  FE  BPNs  that  used  four  fuzzy  sets  attained  their  classification  rates  with  an 
iteration  count  of  roughly  an  order  of  magnitude  less  than  their  NE  counterparts.  Moreover,  when  eight  fuzzy  sets 
were  used  an  additional  order  of  magnitude  reduction  in  the  number  of  iterations  was  achieved.  These  significant 
reductions  do  not  precisely  translate  to  corresponding  increases  in  speed  because  there  are  roughly  4  times  the 
number  of  computations  that  have  to  be  performed  for  the  FE  BPNs  using  four  fuzzy  sets  (8  times  for  the  FE  BPNs 
using  8  fuzzy  sets).  Nevertheless,  taking  this  fact  into  account,  the  FE  BPNs  performance  were  still  many  times 
better.  It  should  also  be  noted  that  when  8  fuzzy  sets  were  used  the  FE  BPNs  were  somewhat  sensitive  to 
overtraining.  That  is,  as  the  iteration  count  increased,  their  performance  with  respect  to  classification  success  was 
slightly  degraded. 

Table  li  clearly  indicates  that  the  FE  BPNs  outperformed  their  NE  BPNs  counterparts  for  the  2-,  3-,  4-,  and  20- 
dimensional  cases. 

Table  1  ii  lists  performance  results  when  varying  amounts  of  gaussian  noise  were  added  to  the  first  coordinate  of 
the  2-dimensional  data  sets.  The  FE  BPNs  produced  comparable  or  more  accurate  classifications  with  far  fewer 
iterations.  It  should  also  be  noted  however  that  NE  BPNs  tended  to  produced  better  results  than  their  noise-free 
counterparts.  This  suggests  that  the  introduction  of  noise  is  indeed  a  useful  enhancement  to  BPNs. 
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Figure  5:  a)  ME  BPN  with  Four  Hyperplanes  b)  NE  BPN  with  Three  Hyperplanes 

The  distribution  of  the  training  data  in  all  of  the  previous  experiments  was  uniform.  Additional  experiments 
were  run  for  the  2-dimensional  case  to  determine  how  well  the  two  types  of  networks  performed  if  the  training  data 
were  not  uniformly  distributed.  Training  data  were  carefully  reselected  to  ensure  non-uniform  distributions;  two 
distinct  bimodal  distributions  and  two  distinct  skewed  distributions.  Results  in  table  liii  indicate  that  FE  BPNs 
again  consistently  outperformed  NE  BPNs  and  with  far  fewer  iterations. 
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Table  1:  Classification  Results  (as  percentages)  averaged  over  100  runs 

(FE-4=fuzzy-encoded  data  using  4  fuzzy  sets,  NE=non-encoded  data 
FE-8=fuzzy-encoded  data  using  8  fuzzy  sets,  Iters=number  of  iterations  (xI.OOO)) 

7.  A  Real-World  Problem 

One-dimensional  magnetic  resonance  (MR)  spectra  were  obtained  at  360  MHz  for  25  thyroid  biopsies:  16 
papillary  carcinomas  and  9  normal.  Two  spectr^  regions  were  analyzed:  the  main  lipid  CH2  and  CH3  peaks, 
0.64-2.59  ppm;  and  the  choline-like  species,  2.59-3.41  ppm.  Analysis  was  based  on  170  input  points  for  the 
choline  region  and  400  input  points  for  the  lipid  region.  It  has  been  demonstrated  in  [4]  that  an  ANN  can  be 
constructed  that  produces  a  robust  classification  of  thyroid  biopsies  given  their  MR  spectra.  The  inputs  to  the  ANN 
were  the  ten  best  principal  components  of  the  original  data  that  accounted  for  97%  of  the  total  variance.  In  this 
paper,  BPNs  using  the  original  spectral  regions  are  used  without  any  preprocessing  (principal  component  analysis) 
and  compared  with  BPNs  using  the  corresponding  FE  spectral  regions.  Twenty  experiments  were  run  for  each  case 
described  below.  Unlike  the  results  discussed  previously  that  were  based  solely  on  the  test  data,  the  average 
performance  results  listed  in  tables  2iv-v  are  based  on  all  of  the  data  (due  to  the  scarcity  of  the  data). 

Four  fuzzy  sets  were  constructed  for  each  input  coordinate  and  the  FE  data  were  generated  (680  and  1600  input 
points  for  the  choline  and  lipid  regions,  respectively).  In  order  to  effect  uniform  coverage,  quartiles  were  computed 
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for  each  coordinate  and  the  fuzzy  sets  were  constructed  around  them  (see  figure  6).  The  intersection,  b,  was  set  to 
0.5  for  all  sets.  Subsequently,  eight  fuzzy  sets  were  constructed  by  dividing  each  quartile  in  half.  Table  2iv  lists  the 
performance  results.  Once  again  FE  BPNs  outperformed  their  ME  counterparts.  What  is  particularly  surprising  is 
the  rate  of  convergence  for  the  FE  BPNs  (for  instance,  the  NE  BPNs  used  to  classify  the  lipid  regions  arc  800  times 
slower  than  the  corresponding  NE  BPNs). 


Figure  6-  Fuzzy  Sets  for  Magnetic  Resonance  Data 

Finally,  comparisons  were  made  using  BPNs  with  some  enhancements:  momentum  term;  modulated  learning; 
hyperbolic  tangent  function  instead  of  the  logistic  function;  and  data  scaling.  In  this  case,  the  FE  BPNs  using  four 
fuzzy  sets  performed  as  well  as  their  NE  BPN  counterparts  for  the  choline  region  but  slightly  poorer  results  were 
obtained  for  the  lipid  region  (table  2v).  Although  convergence  still  occurred  much  more  quickly  with  the  FE  BPNs, 
the  NE  BPNs  converged  approximately  twice  as  quickly  with  enhancements  as  without,  whereas  the  FE  BPNs 
converged  roughly  3-5  times  more  slowly.  Moreover,  when  8  fuzzy  sets  were  used,  the  overall  classification  rates 
were  significantly  poorer.  Since  data  scaling  occurred  after  the  data  were  fuzzy  encoded,  the  information  content  of 
the  FE  data  may  have  actually  changed,  thereby  affecting  the  nature  of  the  problem.  It  was  noted  that  when  at  least 
one  of  the  BPN  enhancements  was  deactivated,  the  FE  BPNs  performance  results  approached  those  found  in  the  FE 
BPNs  without  any  enhancements.  This  sensitivity  of  FE  data  to  conventional  BPN  enhancements  warrants  further 
study. 


8.  Future  Activities 

A  number  of  research  activities  need  to  be  pursued  to  further  test  the  effectiveness  of  fuzzy  encoding,  not  the 
least  of  which  is  further  experimentation  employing  “real-world”  data.  Trapezoidal  fuzzy  sets  may  be  used  to 
determine  if  they  are  of  any  additional  benefit.  The  1-1  correspondence  is  lost  and  this  will  affect  the  information 
content  of  the  input  values  but  the  resulting  BPN  may  become  more  robust.  Further  analysis  is  required  concerning 
the  sensitivity  of  FE  data  to  BPN  enhancements.  Methods,  other  than  uniform  coverage  per  input  unit,  need  to  be 
examined  for  the  selection  of  the  type  and  number  of  fuzzy  sets.  For  example,  a  clustering  method  such  as  fuzzy  c- 
means  [5]  or  Kohonen  self-organizing  maps  [6]  may  be  used  to  intervalize  the  data. 

This  paper  has  demonstrated  the  efficacy  of  fuzzy  encoding  input  data  for  artificial  neural  networks  that  employ 
the  back-propagation  algorithm.  Compared  to  their  NE  counterparts,  FE  BPNs  consistently  produced  superior 
classification  results  with  dramatically  improved  rates  of  convergence.  Additional  areas  of  inquiry  need  to  he 
examined,  especially  employing  “real-world”  data,  but  the  initial  results  aie  extremely  promising.  In  partiv  ' 
since  the  volume  of  Mit  spectral  data  used  for  clinical  diagnosis  is  growing  rapidly,  a  variety  of  multivariate 
techniques  such  as  ANNs  need  to  be  used  in  order  to  quickly  and  accurately  classify  them.  Fuzzy  encoding  should 
be  considered  as  another  tool  in  this  arsenal. 


9.  References 

1. Rumelhart,  D.  E.  and  McClelland,  J.  L.,  Parallel  Distributed  Processing,  vol.  1,  MIT  Press,  Mass.,  1986. 

2.  Zadeh,  L.  A.,  “Outline  of  a  new  approach  to  the  analysis  of  complex  systems  and  decision  processes”,  IEEE 
Transactions  on  Systems,  Man  and  Cybernetics,  SMC-1, 1973,  pp.  28-44. 

3. Lippmann,  R.  P.,  “An  introduction  to  computing  with  neural  networks”,  IEEE  ASSP  Mag.,  April  1987, 4-22. 

4.  Somorjai,  R.  L.  Pizzi,  N.,  Nikulin,  A.,  Jackson,  R.,  Mountford,  C.  E.,  Russell,  P,  Lean.  C.  L.,  Debridge,  L., 
Smith,  I.  C,  P.,  “Thryoid  Neoplasms:  Classification  by  Means  of  Consensus  Multivariate  Analysis  of  *H  MR 
Spectra”,  12*^  Annual  Scientific  Meeting  of  the  Society  of  Magnetic  Resonance  in  Medicine,  New  York,  1993. 

5.  J.  C.  Bezdek,  R.  Ehrlich,  and  W.  Full,  “FCM;  the  fuzzy  c-means  clustering  algorithm”.  Computers  and 
Geosciences,  \o\.  10,  1984,  191-203. 

6.  Kohonen,  T.,  Self-Organization  and  Associative  Memory,  Springcr-Verlag,  New  York,  1989. 


III-648 


Improving  Generalization  with  Symmetry  Constraints 

by 


N.  Scott  Cardell,  Wayne  H.  Joerding,  and  Ying  Li 
Washington  State  University,  Pullman,  WA  99164 

Abstract' 

This  paper  presents  research  on  the  benefits  of  using  a  priori  information  about  the  symmetry 
of  cross-partial  derivatives  to  improve  generalization.  We  show  how  to  impose  the  symmetry 
constraint  on  a  global  training  algorithm  and  demonstrate  its  efficacious  use  with  a  problem 
in  economics. 


1.  Introduction 

This  paper  presents  preliminary  results  from  our  research  into  imposing  a  priori  informa¬ 
tion  on  feedforward  neural  networks.  We  take  as  an  example  the  imposition  of  symmetry 
constraints  suitable  when  using  a  feedforward  network  to  approximate  a  system  of  nonlinear 
e((uations  derived  as  the  gradient  of  some  known  or  unknown  function.  This  problem  can 
arise  in  many  fields.  For  example,  in  geology  detection  of  magnetic  anomalies  depends  on 
the  gradient  of  a  gravitational  potential  function.  In  economics,  the  condition  for  profit 
maximization  sets  the  gradient  of  the  production  function  equal  to  the  real  input  prices.  In 
electrical  engineering,  the  non-linear  behavior  of  a  MOSF*£T  device  depends  on  the  gradi¬ 
ent  of  the  device  response  function  with  respect  to  drain  and  gate  source  voltages.  In  each 
of  these  cases  observations  on  the  gradient  of  a  non-linear  response  function  can  represent 
important,  or  even  the  only,  information  about  the  phenomena  of  interest,  such  as  in  the 
magnetic  anomaly  example. 

The  universal  approximation  capabilities  of  feedforward  networks  make  them  good  can¬ 
didates  for  a  semi-nonparametric  approach  to  modeling  non-linear  functions,  but  traditional 
implementations  ignore  a  pnon  information  about  the  problem  implied  by  the  symmetry  of 
crf>ss- partial  derivatives.  In  this  paper,  we  show  how  to  impose  symmetry  constraints  and 
demonstrate  their  usefulness  in  an  example  taken  from  economics. 


2.  Symmetry  in  gradient  vector  equations 

Let  V**  ;  R*®  -♦  R  represent  a  twice  differentiable  function  of  kg  inputs,  and  ^*(x)  =  VV’*(x) 
its  kg  dimensional  gradient  vector.  If  we  were  to  observe  a  sample  (o„,x„),  where  o„  = 
+  e„,  n  =  I, . . . ,  N,  e„  a  mean  zero  noise  term,  then  we  could  use  the  data  to  train  a 
network  to  approximate  the  unknown  function  and  its  derivatives  on  a  compact  set  (see,  for 
example,  Hornik.  Stinchcombe,  and  White  (1989,  1990)).  Sometimes,  however,  we  do  not 
observe  a  number  o„,  but  instead  observe  a  vector  y„  =  4'*(x„)  4- e„  where  e„  represents  a 
vector  of  mean  zero  noise  terms.  In  other  cases  we  observe  both  o„  and  y„. 

'This  research  was  partially  funded  by  National  Science  Foundation  Grant  No.  SES-9022773. 
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In  either  of  these  cases,  approximating  the  unknown  response  functions  ip  and/or  can 
benefit  from  using  the  o  priori  information  that  the  Hessian  matrix  for  V’*(x),  defined  by  the 
koxko  Jacobian  V^'*(x),  must  be  a  ko  x  ko  symmetric  matrix.  In  other  words,  the  symmetry 
of  cross-partial  derivatives  defines  a  property  that  a  network  approximation  of  ^*(x)  should 
also  satisfy. 

In  this  paper  we  consider  using  a  single  hidden  layer  feedforward  network  to  approximate 
'P  when  one  only  observes  y„.*  Let  ^(x)  represent  a  feedforward  network  with  ko  inputs, 
jb2  =  fcfl  outputs,  and  connection  weights  W .  We  seek  to  approximate  4'*  :  X  — ♦  R*®  for 
some  set  X  C  R**  using  the  network  4*.  The  above  reasoning  demonstrates  that  we  should 
require  4'(x)  to  satisfy  symmetry  for  all  x  €  X.  We  define  4'(x)  =  WiF(Wox)  as  a  single 
hidden  layer  network  with  k\  hidden  units  and  connection  weights  W  =  (Ho.W,),  where 
the  Wi's  represent  ki+i  x  k,  weight  matrices,  F(Hox)  =  (/(wo.jx), . . . ,  /(wo,t,x))^,  a  vector 
of  activation  functions,  and  Wo>  represents  the  row  of  Wq.  Let  Wk,i,,  represent  the  {(,i) 
element  of  W*,  /'  =  and  /'  =  /'(wo.ix)  for  some  x  6  X.  Then 

dF 

V4»(x)  =  F'(Wox)VV,\  where  F' =  ^ 

=  (/K.i  /2W0.2  - 


=  < 


//;  0  ...  0\ 

0  /i  ...  0 


<• 


(1) 


VO  0  ...  nj 

Let  rp,j{x)  define  the  (i,  j)  term  of  V4»(x).  Then 

*1 

tm\ 

Therefore  symmetry  requires  that 

*1 

V',.(x)  -  =  II/'(wo^x)(u>o,/jtc,^^  -  =  0 

tm\ 


(2) 


(3) 


for  i  =  1, . . . ,  ko  -  1  and  j  =  1  +  I .....  kg.  We  can  express  the  constraints  defined 

by  (3)  more  corn  ctly  by 


^^^^(IVox)  s  0  for  all  X  €  X  / 

''  U  =  «  +  1 ,  .  ko 


(4) 


where  F'(x)  =  (/*{wo,|X) . /'(wo>,x))^  and  represents  a  kj  x  1  vector  of  the 

(u;o./jti;i.i^  -  terms  from  (3). 


*We  defer  the  more  general  problem  in  which  one  obeenres  both  o«  and  to  future  work. 


Cardell,  Joerding,  and  Li  (1993)  show  that  under  fairly  general  conditions  (4)  can  be 
satisfied  if  and  only  if  Uij  =  0.  From  the  definition  of  Uij  we  see  that  this  requires 
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f  > 
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i  +  1, 

. . .  ,ko. 

Or,  alternatively. 

=  Ptwot,  for 

1. 

1, 

. . .  ,ki, 

...ko 

(6) 

for  some  fci  x  1  vector  of  constants  /?  =  (A  ,  -  . 

The  universal  approximation  capabilities  of  feedforward  networks,  as  described  in  Car- 
roll  and  Dickinson  (1989),  Cybenko  (1989),  Hornik  et  al.  (1989,  1990),  and  Ito  (1991,  1992), 
explain  much  of  their  usefulness.  Thus,  we  do  not  want  to  lose  these  capabilities  when 
imposing  the  symmetry  constraints  described  in  (6).  The  Symmetry  constraints  require  sat¬ 
isfying  equality  conditions,  and  so  pose  somewhat  more  danger  of  reducing  the  universal 
approximation  capability  than  do  inequality  constraints.  (See  Gallant  Gallant  (1982),  p307, 
for  an  example  using  inequality  constraints.)  This  derives  from  the  reduced  dimension  of  the 
function  space  that  satisfies  the  symmetry  constraint.  That  is,  because  functions  satisfying 
equality  constraints  occupy  a  lower  dimension  subspace  of  the  unconstrained  function  space, 
there  may  not  exist  a  network  that  satisfies  the  symmetry  constradnt  and  comes  arbitrarily 
close  to  any  function  with  a  symmetric  Hessian  matrix.  Fortunately,  it  turns  out  that  net¬ 
works  satisfying  the  kind  of  constraints  defined  in  (6)  possess  the  same  type  of  approximation 
capability  described  in  Hornik  et  al.  (1989,  1990),  see  Cardell  et  al.  (1993). 

Finally,  we  note  that  one  can  use  results  in  Cardell  el  al.  (1993)  with  the  results  in 
White  (1990)  to  show  that  a  constrained  network  that  minimizes  the  sum  of  squared  errors 
converges  consistently  to  the  gradient  system  ^4'*(x).  Thus,  there  exist  appropriate  growth 
rates  for  the  number  of  hidden  units  to  insure  that  trained  networks  converge  almost  surely 
to  the  true  gradient  system. 


3.  Training 

Training  seeks  values  for  the  weights  that  minimize  the  sum-of-squared  errors  SS  =  Z^=i(yn- 
^(x«))^(y,-4'(x»))  subject  to  the  constraints  (6).  The  constraints  in  (6)  provide  a  straight¬ 
forward  extension  for  many  training  methods  but  especially  so  for  hybrid  methods  such  as 
described  in  Li,  Joerding,  and  Genz  (1993)  or  Webb  and  Lowe  (1988).  At  each  iteration 
these  hybrid  methods  update  the  Wq  matrix  and  then  solve  k2  systems  of  overidentified  lin¬ 
ear  equations,  (j*,,,  =  Wj,,F(IVox„).  n  =  I. . . . ,  N},  i  =  1, . . . ,  ^2  to  compute  the  Wi  weights 
given  Wq. 

We  can  take  the  same  approach  to  the  constrained  problem  by  altering  the  nature  of  the 
linear  least  squares  sub>problem.  Specifically,  we  solve  a  system  of  linear  equations  with  the 
typical  equation 

yi.n  =  (/?‘^.^^)K.i../(wo.|X,  ),....  i«o,*,../(wo>,x,  +Mt.)./i(»).  ,A-,(i)r.  (7) 


in-65I 


where  n  =  i  =  and  represents  a  bias  parameter  for  the  t'*'  hidden 

unit  and  6  =  (^i, .  ■  •  represents  bias  parameters  for  each  of  the  output  units,  and 

,  /k,(i)  form  a  vector  of  indicator  variables  such  that  /^(i)  =  1  if  t  =  /i,  0  otherwise. 
Thus,  instead  of  having  systems  of  equations  with  N  equations  each  and  a  total  of  k\{kQ  + 
l)+fc2(*:i  +  l)  parameters,  the  constrained  sub-problem  has  a  single  system  of  A:2  x  iV  equations 
and  ki{kQ  +  1)  ki  +  k2  parameters.  Since  computation  time  in  the  sub-problem  increases 
as  the  square  of  the  number  of  parameters,  each  iteration  of  the  constrained  algorithm  takes 
more  time  than  the  unconstrained  algorithm. 


4.  Example 

Presumably,  the  use  of  a  priori  information  can  improve  the  ability  of  a  network  to  generalize 
out  of  sample.  (See  Joerding  and  Meador  (1991)  for  more  discussion.)  To  demonstrate  this 
effect  we  take  an  example  from  our  own  field  of  economics.  A  well-known  result  in  economic 
theory  concludes  that  a  profit-maximizing  firm  sets  the  gradient  of  the  production  function 
with  respect  to  factor  inputs  (such  as  capital  and  labor)  equal  to  the  real  input  prices. 
Sometimes  economists  do  not  observe  output  levels  of  a  firm  but  do  observe  input  levels 
and  real  factor  prices.  From  these  data  economists  can  recover  some  characteristics  of  the 
unobserved  production  process  by  relating  input  prices  to  factor  input  levels,  in  other  words, 
by  approximating  a  relationship  of  the  form  y„  =  ^*(x„)  -b  e„.  To  make  the  best  use  of 
expensive  data  and  to  improve  generalization  the  network  approximator  to  'I'*  should  satisfy 
the  symmetry  constraints  described  above. 

Of  course,  the  a  priori  information  must  be  correct  for  it  to  benefit  generalization.  We 
also  expect  a  priori  information  to  have  the  most  value  for  small  sample  sizes.  Thus,  for 
our  demonstration  we  generate  a  modest  amount  of  data  from  a  known  data-generating 
process  (DGP)  and  then  seek  to  approximate  that  process  with  various  single  hidden  layer 
feedforward  networks.  Specifically  we  generate  10  different  samples  of  50  observations  each 
from  the  gradient  of  0(x)  =  Xixf  where  x\  represents  capital  and  x^  represents  labor  inputs. 
The  input  data  come  from  random  selection  of  points  in  the  square  (1,20]  x  (l,20|.  We  then 
use  these  data  to  train  networks  with  2,4,6,. ...28  hidden  units,  measuring  the  appro.ximation 
error  {AE)  of  the  resulting  networks  by  summing  the  absolute  deviation  of  the  network  from 
the  true  value  at  each  point  on  a  mesh  covering  the  domain  of  the  input  data.  Lines  on  the 
mesh  have  a  .5  spacing. 

As  noted  above,  the  number  of  free  weights  grows  more  quickly  in  the  unconstrained 
networks  than  in  the  constrained.  Thus,  we  limit  training  of  unconstrained  networks  to 
26  hidden  units.  This  results  in  the  number  of  free  weights  varying  from  12  to  132  for 
the  unconstrained  networks  and  from  10  to  88  for  the  constrained.  We  train  the  network 
using  a  hybrid  algorithm  based  on  simulated  annealing  to  find  a  global  minimum  to  the 
sum-of-squares  function,  see  Li  et  al.  (1993).  Taken  together  we  have  130  observations  on 
the  approximation  error  for  unconstrained  networks  and  140  observations  on  constrained 
networks.  We  then  fit  these  AE  values  to  quadratic  and  cubic  equations  in  the  number  of 
hidden  units,  H,  and  the  number  of  free  parameters,  K.  Plots  of  these  fitted  polynomials 
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are  displayed  in  Figure  1  and  2.  Note,  although  this  is  not  clear  in  the  picture,  the  N  line 
lies  everywhere  above  the  Nc  line  in  the  right  side  panel  of  Figure  2.  Also,  the  N  cubic  lines 
decline  for  very  low  numbers  of  hidden  units  and  free  parameters,  an  anomaly  that  does  not 
appear  in  a  quartic  polynomial. 

The  plots  show  that  the  symmetry  constraint  lowers  the  approximation  error  almost 
uniformly  and  postpones  the  onset  of  degraded  approximation  as  the  number  of  hidden 
units  increases.  The  postponement  effect  shows  up  most  strongly  when  plotted  against  the 
number  of  hidden  units  (left  side  of  figures).  Because  the  number  of  hidden  units  is  the 
only  complexity  control  parameter  in  a  feedforward  network,  this  represents  an  important 
advantage  for  constrained  networks. 


Figure  1:  Quadratic  approximation  of  approximation  error,  AE,  for  unconstrained  N,  and 
constrained,  Nc,  networks  as  a  function  of  the  number  of  hidden  units,  H,  and  the  number 
of  free  parameters  K. 


Figure  2;  Cubic  approximation  of  approximation  error,  AE,  for  unconstrained  N,  and  con¬ 
strained,  Ne,  networks  as  a  function  of  the  number  of  hidden  units,  H,  and  the  number  of 
free  parameters  K. 
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Abstract 

The  macro  and  micro  mechanum  of  error  absorption  have  been  developed  for  fitting  an  under¬ 
fitting  (skeleton)  net  in  this  paper  The  theoretic  analysis  and  experimental  results  are  also  given  in  this 
paper. 


I.  latrodactioB 

In  past  few  years,  a  tot  of  researchers  have  paid  ^tendon  to  the  improvment  of  the 
generalistoion  poformance  of  neural  networks.  For  example.  Mozer  [1]  proposed  a  skeletortizadon 
method  at  the  Colorado  University  in  1989.  He  determined  the  funcdonality  or  relevance  of  individual 
hidden  and  input  tmits  using  the  knowledge  in  the  net.  His  basic  idea  was  training  the  net  to  a  certain 
criterion,  computing  the  measurement  of  relevance,  attd  trimming  the  least  relevant  uniu.  Yann  [2] 
developed  a  method  called  the  'Optimal  Brain  Dantage*  at  the  AT&T  Bell  Laboratory  in  1990.  He 
derived  a  nearly  optimal  schemes  for  adapting  the  size  of  a  neural  network  using  the  information-theoretic 
ideas.  Weigend  (3)  introduced  the  information  theoretic  idea  of  minimum  description  loigth  into  the 
weight  elimination  of  the  neural  network  at  the  Stafford  University  in  1991 .  Ramachattdran  [5]  removed 
the  superfluous  hidden  units  based  on  their  information  measwes,  which  bonowed  from  decision  tree 
ituhiction  techniques  at  the  Texas  University  in  1992.  Zimmermaiui  [6]  designed  the  active  and  deactivate 
test  variables  for  elimination  process  at  die  Siemaits  Copetation  in  1992. 

All  methods  mentioned  above  skeletonized  an  over-friting  neural  network  with  diffnent 
measurements.  There  are  two  problems  associated  with  them.  The  first  is  how  to  choose  an  initial  (over- 
filting)  net.  which  will  affect  the  process  of  dreletonization.  For  example,  if  100  hidden  units  is  an  optimal 
sohition.  the  initial  net  with  300  hidden  units  will  have  longer  process  of  skeletonization  than  the  initial 
net  with  200  hidden  units.  In  general,  it  is  difficult  to  choose  the  suiud>le  initial  net,  although  there  were 
some  papers  describing  how  to  choose  the  optimised  structure.  The  second  is  that  there  are  some  abysses 
and  many  local  minimum's  in  the  training  process  (7].  If  otte  chooses  a  larger  initial  net.  there  will  be 
larger  possibilities  that  the  training  sinks  into  an  abyss  or  sticks  at  a  local  minimum  such  that  the  final  net 
still  has  lower  generalisation  performattce. 

Sethi  (8]  has  developed  a  new  type  of  skeleaon  neural  network.  This  net  has  tw  redundant 
weights  for  error  absorption.  If  one  tiaiiu  this  skeleton  net,  the  generalisation  performance  will  not  be 
higher  because  of  the  ataence  of  redundant  weights  for  error  hbsorptiott  Our  experiments  have  shown  that 
if  we  train  this  skeleton  net  with  some  added  redundant  weights,  the  generalisation  performance  will  be 
improved  to  some  degree.  A  skeleton  net  is  called  the  under-fitting  net  and  the  process  of  adding  some 
redundant  weights  on  the  skeleton  net  is  called  the  fitting  process.  In  this  paper,  a  detail  analysis  of  the 
error  absorption  mechanism  and  the  experimental  results  ve  presented. 

2.  The  Macro  Mffhani— of  Error  Abaorption 

The  mechanism  of  error  absorption  is  composed  by  two  ports:  macro  mechatusm  and  mkio 
mechanism.  Most  back-propagation  neural  networks  use  the  algorithm  proposed  by  Rumelhart  [9]  for 
their  weights  updating: 


Aw**  — 
dw 

The  structure  information  of  a  net  is  described  as: 


Oj=fj(x^)  J:/=Iw^a* 
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t 

where,  A  =  {|<i,  ||  is  the  activate-spiice  of  the  input  units,  hidden  units,  and  output  units, 

W  ~  is  the  weight-space  of  the  weights  between  the  input  layer  and  hidden  layer  and  the 

we^ts  between  the  hidden  layer  and  output  layer,  and  F  =  {{/,,),{ fj  II  is  the  function-space  of  the 
hidden  loiiis  and  uuqMt  units.  The  uniform  activate  function  is  the  sigmoid  function; 


The  deduced  weight-updating  formulas  are: 


^  d£  dE  dE  ^ 


=  »It7“  =  )/*«.  - 

dw^  ,  da/ 

Ftum  these  weight-updating  formulas,  one  can  futd  cm  that  the  error  occurring  at  an  output  unit 
will  be  abaorhed  tlvough  the  weight-space  by  the  units  at  the  lower  layers.  In  fact,  the  error  occurring  at 
any  ouqHH  imit  is  synthesised  from  the  units  at  the  lower  layers  through  the  weight-space  during  the 
forward-propagation.  The  weight-updating  is  a  back-error  propagation  process,  which  propagates  the 
errors  occurring  at  the  output  units  back  to  the  units  at  the  lower  layers  and  modifies  the  weight-space  by 
the  gradient  descent  melh^. 


Consider  a  net,  which  just  has  one  output  and  two  layers  as  shown  in  figure  I.  In  figure  I  (a)  the 
output  error  r„  is  absorbed  by  the  hidden  units  h,  and  hj  through  the  weight-space  .  While  in  figure  I 
(b),  the  output  error  is  absorbed  by  the  hidden  units  h, .  A,  and  A,  through  the  weight-space  W, .  The 
activate  values  of  s.  and  at  the  next  time  step  can  be  obtained  as  follows; 

=  /!(**',  +  Aw,  )a»,  +(wj  +  Awj )a„ |  - - ^  *  - - - ■  — y- 

*  _y  l+A  ‘f  ^ 

Ur  ***  V  ^ 
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a'*^  =  /l(  iv,  +  Ah',  )<j*,  +(1^2+  Aikj  )a*2  +  (  “'j  +  Ah',  )«*,  ]  = 


1 


1  +  e 


7<'*i  ♦**42 +"4} 

«4 


l+*4 


here,  and  are  weight-independent  cuefficienu.  Because  of 

dx 

*»<*«<  I  </(^>0). 

dr 

then 

1*1  1*1 
e*  ■ 

That  is.  the  etror  at  will  decay  faster  than  at  x, .  It  can  he  concluded  that  larger  weight-space  can 
improve  the  error  absorption  more  than  smaller  weight-space.  This  was  supported  by  experimental  results 
shown  in  figure  2.  It  is  clear  from  figure  2  tiuu  the  redundant  net  (the  net  with  some  redundant  weights) 
converges  faster  than  the  one  without  redundant  weights.  This  is  called  the  macro  mechanism  of  error 
absorption. 


3.  TW  Mkro  McchMim  of  Emir  AbiorptiiNi 

If  there  are  mure  than  one  oMpui  units  in  a  net.  the  output  unit  with  the  largest  error  will  have 
the  largest  conirihutiun  lo  the  net  error  and.  this  kind  of  output  unit  should  he  selected  he  adding 
redundant  weight  at  first.  This  is  because  through  the  tedundatw  weight,  the  error  at  this  output  unit  can 
be  abaorbed  partially  by  a  unit  at  the  lower  layer.  After  the  oinput  unit  is  chosen,  which  unit  at  the  lower 
layer  has  hr  be  selecled  hi  connect  hi  this  output  unit  should  he  amsidered.  Cimsidcr  the  situation  shovn 
in  figure  t; 


Figure  i 
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Given  die  wanted  activate  values  of  h,  and  h.  are  and  a.  respectively.  If  da.  >da.  .  a.  will  have 

"4  \  \ 

A  A 

larger  deviation  from  ,  and  will  have  smaller  deviation  from  a.  (The  derivatives  on  the  units  at 

the  lower  layer  can  be  obtained  by  BP  method).  In  qualitative  view,  has  less  ability  to  let  the  error  on  x 
parasite  in  it  or  has  less  ability  to  absorb  the  error  on  x  because  its  error  is  already  larger.  In  quaniituive 
view,  we  can  deduce  a  dnivation  on  x  : 


k.  *e  ^ 


)= — - V — 

(Uf  *e  ^  ^  ^ 


Given  the  approaimate  conditiun  that  there  is  no  big  difference  among  s  and  among  a  s.  Because  of 


da.  >da.  ,  then. 

*4  *» 

du';‘{^)<du‘/‘(h4). 

That  is.  connecting  to  the  unit  with  the  .smallest  derivation  at  the  lower  byer  will  result  in  better  error 
absorption. 


4.  The  Ferfonn— w  Mcnenrtmwitt 

In  dtis  paper,  two  important  measurements  used  to  measure  the  performance  were: 
(a)  Training  Error. 


7»f  = 

A 

where,  r  is  the  target  output  vectors  of  the  training  patterns  and  r  is  the  actual  output  vectors  responding 
to  the  inpm  vectors  of  the  trainirg  patterns: 

(b)  Gcnerahsaiiun  ^rrformance: 


f;P»(l-  )*I00%. 

total  tfsu 

where,  the  ’failiMes*  is  the  number  of  the  deciston  frulwes  when  a  trained  net  is  used  for  diagnosis  or 
fciMnninf. 


5.  AlgurMbm 

The  algorithm  u  shown  as  follows: 

Step  I.  Constnictmg  a  skeleian  net  with  espert  knowledge: 

Step  2.  Piepahng  tramii^  data  and  test  dai^ 

Step  \  Determining  the  optimised  traming  parameters: 

Step  4.  Tirainiag: 

Step  5.  If  the  generaiisalion  performance  is  satisfied,  gem  step  7.  else  go  on: 
Step  6.  Applying  the  error  abeorption  into  the  net.  gow  step  4: 

Step  7.  Slop. 
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6.  The  Expcrimciital  Results 

Below,  we  present  three  experiments  and  their  simulation  results.  The  first  and  the  last  ones  me 
classification  problems.  The  second  one  is  an  optimisation  problem.  All  of  them  have  the  same 
characteristic  in  that  the  initial  skeleton  net  can  be  obtained  by  some  expert  knowledge. 

Figure  4  gives  the  experimental  results  of  partitioning  the  plane  points  into  two  fields  and  the 
problem  itself  is  a  XOR  problem  (8].  Figure  4  (a)  presents  the  simulation  result  with  the  error  ab.sorption 
mechanism.  Figure  4  (b)  gives  the  simul^on  result  without  this  mechanism.  The  fitting  epochs  are  five 
for  both  them.  Cleaily,  with  the  error  absorption  mechanism,  the  generalisation  performance  has 
increased  up  to  90^),  as  shown  in  figure  4  (a).  But  without  the  error  absorption  mechanism,  the 
generalisation  has  decreased  down  to  76%,  Meanwhile,  the  training  error  decays  with  the  error  absorption 
mechanism  more  than  without  the  error  absorption  mechanism.  For  example,  the  training  error  reaches 
U.008  in  figure  4  (a),  while  one  just  reaches  U.U17  in  figure  4  (b).  Figure  .S  shows  the  results  irf  finding  a 
minimum  value  among  some  values.  Figure  5  (a)  and  (b)  illustrate  the  simulation  results  with  and  without 
the  error  absorption  mechanism  na^dvely.  The  fitting  epochs  are  six  both  for  them.  In  figure  5  (a),  the 
total  tendency  of  generalisation  piTfurmance  has  increased  up  to  98..S%,  while  in  figure  5  (b),  the 
generalisation  performance  oscillations  greatly.  Figure  6  presents  the  same  experiment  as  figure  4,  but 
with  the  momentum  factor  in  the  net.  Figure  6  (a)  and  (b)  present  the  simulation  results  with  and  without 
the  error  absorption  mechanism  respectively.  Because  of  the  function  of  momentum,  the  difference  Ls 
smaller  between  the  methods  with  and  without  the  error  absorption  mechanism.  However,  adding 
momentum  factor  means  mcreasing  the  computational  cost  and  space  cost. 


1  2  3  4  5 

nuat  Epock 


I 

I  0.5 


-CGP — a 


— 


2  3  4  5 

Flnii«Epock 


(a)  (b) 

Figure  4  The  experimental  results  of  the  plane  point  partitioning 


I 


Rguie  ^  The  experimental  results  of  finding  minimum  value  among  three  values 
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Figure  6  experimental  results  of  the  plane  point  ptirtitioning  with  the  momentum  factor 

7.  Coodusioa 

This  paper  has  presented  a  new  method  for  improving  the  generalisation  performance  of  neural 
networks.  The  mechanism  of  error  absorption  was  developed  for  fitting  an  under-fitting  (skeleton)  net. 
The  experimental  re»ilts  have  shown  that  it  is  very  useful  for  one  type  of  problems,  where  the  initial 
skeleton  net  can  be  constructed  by  the  expert  knowledge,  and  the  training  technique  is  used  to  overcome 
the  drawback  of  incomplete  knowledge  of  expert.  With  this  new  method,  sinking  into  the  abyss  can  be 
avoided  and  the  generalisation  performance  improvement  time  is  obviously  reduced. 
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Abstract 

We  propose  a  neural  network  system  that  sequentially  obtains  I/O  sample  data.  The  sys¬ 
tem  selects  useful  sample  data  as  training  data,  in  what  we  call  active  data  selection 
(ADS),  and  interpolates  errors  between  training  data  and  the  network  output,  called  sub¬ 
sequent  revision  (SR).  ADS  removes  sample  data  if  doing  so  only  causes  small  errors. 
To  speed  up  ADS,  we  ignore  errors  generated  by  the  network  and  consider  only  those 
from  SR. 

We  found  that  ADS  steadily  decreases  errors  and  that  SR  gives  suitable  output,  even  if 
the  neural  network's  learning  is  still  not  adequate.  Simulation  demonstrated  the  ability  of 
the  network  to  learn  a  sine  function  from  sample  data  distributed  unevenly  in  the  input 
^ace. 


1.  Introduction 

Adaptive  lA)  systems  that  interpolate  samj^e  data  arc  classified  to  two  typical  techniques  —  storage  or 
learning.  Storage  techniques,  e.g.,  the  k  nearest  neighbors  method,  simpdy  store  sample  data  without  learn¬ 
ing  until  the  data  is  interpolated  and  output  Although  techniques  of  this  type  dispense  with  learning, 
memory  and  processing  requirements  and  the  response  time  tend  to  increase  with  the  amouru  of  sample 
data.  Although  learning  techniques,  e.g.,  the  gradient  descent  method  for  a  layered  neural  network,  express 
lA)  relationships  compactly  and  shorten  the  response  times,  more  time  is  needed  for  adaptation  due  to  the 
increased  learning  workload. 

Combined  techniques  with  sample  data  selection  have  also  been  proposed.  In  some  studies  [1.2, 3], 
the  output  is  superimposed  Gaussian  functions  associated  with  the  sample  data  and  constants  or  linear  func¬ 
tions.  If  the  system  performs  well  with  new  sample  data,  the  parameters  are  updated  using  the  gradient 
descent  method.  If  the  system  performs  poorly,  the  data  is  used  to  add  new  Gaussian  functions.  The 
criteria  for  selecting  useful  data  is  studied  in  other  situations  [4,  S].  Oka  proposed  the  system  which 
chooses  the  appropriate  output  of  back-propagation  or  memory-based  learning  sub  systems  [6]. 

We  combine  a  lajrered  tKural  network,  which  learns  by  back-propagation,  with  storage  techniques. 
Active  data  selection  (ADS)  selects  useful  training  data  from  sequentially  given  sample  data.  Tbe  training 
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data  is  used  for  neural  network  learning  and  subsequent  revisions  (SR),  which  adjust  the  network  output  by 
interpolating  errors  between  the  training  data  and  network  output.  ADS  selects  training  data  based  on  the 
principle  that  more  errors  occur  when  useful  data  is  removed.  ADS  limits  the  amount  of  training  data 
stored,  which  prevents  memory  and  processing  requirements  from  increasing.  The  training  data  chosen  by 
ADS  steadily  decreases  errors.  SR  adjusts  the  network  output,  to  compensate  for  insufficient  network 
learning  and  enable  the  system  to  adapt  quickly.  The  single- layered  superposition  of  Gaussian  functions 
[1, 2,  3]  is  sufficiently  easy  to  enable  sample  data  to  be  evaluated  for  selection.  The  complex  output  of  a 
layered  neural  network,  however,  makes  it  difficult  to  evaluate  sample  data.  To  speed  up  selection,  ADS 
considered  only  SR  errors  and  ignored  those  of  the  neural  network  to  reduce  the  network  learning  workload. 


2.  System 
2.1  SR 

The  proposed  system  generates  an  input-related  output  while  adapting  ii.self  to  sequentially  given 
sample  data  (Figure  1 ).  When  input  vector  x  is  given,  adjustment  e{x)  is  added  to  the  neural  network  output 
/y(x)  in  the  adder  to  give  an  adjusted  output /(x). 


Neural  network  Ax)  =ftAx)  ■¥  e(x) 
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Ax)=f^(.x)  +  e(x)  (I) 

The  three-layered  network  learns  by  back-propagation  [7]  using  daiasci  A  which  is  stored  in  the  active- 

data  storage  unit.  Dataset  A  =  (P,,  Pj . P^)  is  the  set  of  sample  data  P.  =  (x^,  y.)  that  represents  up  to  M 1/ 

O  pairs. 

The  (x>inparison  unit  generates  a  set  of  errors  (i  =  1  to  M). 

(2) 

£e,W.(x)exp(-'i^M 

*(x)  =  e(x;P,.P......  =  - s - 

IWiU)  (3) 

i«  1 

(X  X.) 

«(x)  =  e^  (x  =  x) 

W.(x):  Weight  of  error  at  x 

x^x.'  Input  of  sample  data 

O;  Exponential  damping  cocfncicnt  of  output 


which  are  the  differences  between  the  neural  network  output /^(x.)  and  the  output  y^  of  data  P,. 


To  make  the  outputytx)  equal  the  sample  datay^.  the  interpolation  unit  derives  the  adjustment  e(x)  for 
any  input  x  from  the  set  of  output  errors  The  adjustment  e(x)  is  the  weighted  average  of 

The  weight  W  fx)  is  proportional  to  inverse  distance  from  sample  data's  input  x^  to  x  (Ix-Xjl)  with  expo- 
neittial  damping  (Figure  2). 


W,(x)  = 


lx-x,|  2jx-Xy|lX|  -X;-|  I 


(4) 


D,;  Exponential  damping  coefTic’cnt  of  weight 


The  subjoined  multiplication  modulates  the 
weigitt  W^(x)  to  make  the  adjustment  «(x)  sriKMih 
in  the  neighborhood  of  the  sample  dau's  input. 
The  weight  W,(x).  corresponding  sample  dau  P . 
is  seriously  affected  by  the  other  sample  data  P^. 
when  exist  between  x^  and  X.  The  exponential 
damp^  in  Eq.(3)  suppresses  the  adjustmeru  r(x) 
in  the  distance  from  any  sample  data. 

This  interpolation  method  is  suitable  for 
multidimensional  inpuu.  and  follows  the  sample 
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diOa  smoothly.  Because  of  the  storage  technique,  the  response  process  and  time  of  the  interpolation  unit 
inctease  with  the  size  of  dataset  A,  but  adapts  rapidly. 


2JADS 

The  active-data  storage  unit  holds  dataset  A  =  (P,.  Pj,...,  P^.  which  contains  up  to  M  pairs  of  sample 
data  P,  s  (x^  y)  (1=  1  to  M).  for  use  in  the  neural  network  learning  and  SR.  As  the  network  fetches  new  dau, 
it  removes  no  longer  tKcded  data  to  maintain  the  size  of  the  dataset.  This  is  supported  by  the  sleep-data 
storage  unit  and  dataset  S  stored  in  it 

If  the  sample  data  does  not  include  noise,  we  must  consider  the  difference  between  the  output  and 
sample  data  output,  selection  criterion  1 ,  and  the  ease  with  which  the  outpHit  is  estimated  from  adjacent 
sample  data,  selection  criterion  2.  Squared  error  Ae,^, 


Ae.^  U,-e(x,;P,,P, . P.,.P„^ . Pj\‘ 


which  is  the  squared  differetKe  between  the  errors  of  the  data  P_  artd  the  adjustment  e(x)  without  data  P., 
was  used  as  the  evaluation  critera  (Figure  3).  This  includes  the  two  selection  criteria  above.  The  larger  the 
value,  the  more  useful  the  data  is.  ADS  is  fast,  because  this  docs  not  involve  the  workload  associated  with 
network  learning. 


Figure  3.  Evaluation  criteria 


3.  Experiment 

Simulation  demonstrates  that  ADS  enables  the  neural  iKiwork  to  retain  data  with  a  suitable  distribu¬ 
tion.  even  when  sample  data  is  unevenly  distributed.  Our  simulation  involved  learning  a  single-input, 
single-output  sine  function: 

y  =  (1  +  sin(2x)i))/2  (6) 

where  input  x  aixl  output  y  arc  in  the  range  (0.0, 1 .0]. 
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The  simulation  used  500  sample  data  (x..  y,)  satisfying  Eq.  (6);  90%  of  the  sample  dau  appeared  vn 
input  area  A  (0.0, 0.S]  and  the  remaining  10%  in  input  area  B  (O.S,  1 .0).  The  data  distribution  in  each  area 
was  unifomi.  As  the  control  a  network  system  without  ADS  was  tested.  The  control  discarded  data  on  a 
FIFO  basis. 

The  neural  network  consists  of  three  layers  with  a  1-6*1  structure.  The  active-data  storage  unit  Imlds 
up  to  ten  sample  data  and  the  sleep-data  storage  unit  up  to  five.  Each  time  the  neural  network  fetches  rxw 
data.  30  network  learning  iterations  follow. 


4.  Results 

Squared  error  E,  relative  to  output 
J{x),  is  represented  by; 

E  =  ;^|wx)-/r,^{jc)pdV 

where  fix):  Output 

f^(x):  True  function  output 

Squared  errors  of  the  system  with 
ADS  decrease  steadily.  Without  ADS. 
the  errors  do  not  decrease  as  much  (Fig¬ 
ure  4).  SR  also  decreases  the  errors  and 
so  compensates  for  slow  learning  in  the 
neural  network.  The  system  outputs 
were  measured  based  on  SOO  sample 
data  (Figure  5).  Without  ADS,  large 
differences  from  /^(x)  occur  in  input 
area  B  (0.5.  1 .0].  where  the  probability 
of  data  appearing  is  low.  ADS,  how¬ 
ever,  decreases  errors  in  both  areas. 
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Figure  4.  Squared  errors 
D'  =  0.)0.o  =  0.10 
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5.  Conclusion 

We  proposed  a  neural  network  system 
which  is  combined  with  storage  technique. 
Active  data  selection  (ADS)  selects  useful 
training  data  from  sequentially  given  sample 
data.  The  training  data  is  used  for  neural  net¬ 
work  learning  and  subsequent  revisions  (SR), 
which  adjust  the  network  output  by  interpolat¬ 
ing  errors  between  the  training  data  and  net¬ 
work  output. 

Simulation  demonstrates  abilities  of  the 
system.  The  training  data  chosen  by  ADS 
steadily  decreases  the  errors.  SR  adjusts  the 
output  to  compensate  for  insufficient  network 
learning  and  gives  an  adequate  result.  The  pro¬ 
posed  neural  network  system  responds  and 
adapts  rapidly  to  sequential  learning. 


Figure  5.  Outputs  after  obtaining 
SOO  sample  data 
D^  =  0.10,  0  =  0.10 


REFERENCES 

(IJ  J.C.  Platt  (1991).  “A  resource  allocating  network  for  function  interpolation,” /VcMra/Compwarton,  3(2), 
213-225 

[2]  J.  C.  Platt  (1991).  “Learning  by  combining  memorization  and  gradient  descent,”  In  R.  P.  Lippmann,  t.  E. 
Moody  and  D.  S.  Touretzky,  ^s..  Neural  Irrformation  Processing  Systems  3.  Morgan  Kaui^ann 

[3]  V.  Kadirkamanathan  aiKl  M.  Niranjan  (1992).  “A  function  estimation  approach  to  sequential  learning  with 
neural  networks,”  CUED/F-UNIFYING/TR.  111. 

[4]  M.  Plutowski  and  H,  White  (1991).  “Active  selection  of  training  examples  for  network  learning  in 
ruiseless  environments,”  Dept.  Computer  Science,  UCSD  TR  90-011. 

{5J  D.  J.  C.  MacKay  (1992).  “Information-based  objective  functions  for  active  data  selection,”  Neural 
Computation,  4(4),  590-604 

[6]  N.  Oka,  and  K.  Yoshida  (1992).  “Combining  back-propagation  and  memory-based  learning, ”dr/t 
conference  of  Japanese  Society  for  Artificial  Intellligence,  8-5,  377-380,  (in  Japanese). 

17]  D.  E.  Rumelhart,  G.  E.  Hinton,  and  R.  J.  Williams  (1986).  “Learning  internal  representations  by  error 
propagation,”  in  Parallel  Distributed  Processing :  Explorations  in  the  Microstructure  of Cognition,  Vol  1 , 
D.  E.  Rummelhart  and  J.L.McQelland  (eds),  MIT  Press,  Cambridge  MA 


in-666 


INPUT  DATA  TRANSFORMATION  FOR  BETTER  PATTERN 
CLASSinCATIONS  WITH  FEWER  NEURONS 


YasuhiroOta 
Bogdan  WQamowski 
Electrical  Engineering  Depaitment 
University  of  Wyoming 
Laramie,  WY  82071,  USA 


Abstract: 

Ordinary  discriminant  functions  for  pattern  separations  are  normally  linear.  Neural 
networks  with  one-layer  architecture  can  classify  only  lineariy  separable  patterns,  and  thus 
multilayer  neural  networks  are  required  for  separatimi  of  nonlinearly  separable  patterns. 
In  this  piq)er,  an  improved  formulation  of  discriminant  functions  with  fewer  neurons  is 
proposed.  This  is  accotrqrlished  by  introducing  an  additional  dimensicHi  to  a  set  of  input 
patterns. 


L  Introduction: 

A  pattern  is  the  quantitative  description  of  an  object,  phenomenon,  or  event  A 
classification  of  patterns  can  be  ^atial  or  temptnal.  Examples  of  the  former  case  are 
pictures,  video  images,  and  characters.  Examples  of  the  latter  case  include  speech  signals, 
seianograms,  and  electrocardiograms,  which  nonnally  involve  ordered  sequences  of  data 
appearing  in  time.  The  goal  of  pattern  classification  is  to  assign  a  physical  object, 
phenranenon,  or  event  to  one  of  Ae  inespecified  classes.  The  mechanism  of  pattern 
recogrtition  (classification)  in  the  human  br^  seems  to  be  almost  inpossible  to  reveal  it 
However,  an  artificial  intelligence  classifying  ^stem  consists  of  an  input  transducer  which 
provides  the  input  pattern  data  to  the  feature  extractor  [1].  Typically,  inputs  to  the  feature 
extractor  are  sets  of  data  vectors  that  belong  to  a  certain  category. 

Several  deagns  have  beat  presented  in  the  past  for  classifying  patterns  using  n- 
dimensioaal  discriminant  functions.  The  efficient  cl^sifiers,  in  general,  are  described  by 
discriminant  functions  that  are  nonlinear  functions  of  input  patterns  [2][3].  As  was 
described  by  Marvin  Minsky  and  Seynxrur  Papert  raie-layer  neural  networks  have  very 
limited  ability  for  pattern  classifications  [4].  They  can  classify  cmly  lirMarly  separable 
patterns;  therefore,  multilayer  nerval  networks  are  required  for  separation  of  iKxilinearly 
sqraraUe  patterns.  This  paper  discusses  how  to  reduce  the  number  of  neurons  with  an 
effective  nonlinear  pattern  classification.  The  formulation  of  the  input  data  transformation 
rrtethod  is  described  in  Section  n,  and  the  simulation  of  a  proposed  network  design  is 
shown  in  Section  III.  Section  FV  concludes  the  design  and  gives  suggestions  of  possible 
future  studies  applicable  to  this  design. 
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IL  Structure  of  Pattern  Ciassificatioiis 


First,  the  assunqnioii  is  made  that  both  a  set  of  /i-diinensional  input  patterns 
and  the  desired  classification  for  each  input  pattern  (dpdj.'-'.d^jare 
known.  The  size  P  of  the  pattern  set  is  finite  and  usually  much  larger  than  the 
dimenskmality  n  of  the  pattern  space.  In  many  practical  cases,  it  is  assumed  that  P  is  much 
larger  than  the  number  of  categories  (classes)  R.  The  goal  is  to  classify  input  patterns  into 
R  categories.  For  given  input  patterns  (x,,X2,***,x^)  with  R  categories,  each  category  of 
input  patterns  normally  has  a  center  of  gravity,  and  it  can  be  found  by 


X,  +X,  ^ - hx, 


,  -  ^*1 
*CG,  -  1 


(1) 


where  the  subscript  CGg  stands  for  the  center  of  gravity  for  the  R-th  category  with  k  input 
data  in  that  category.  Once  the  centers  of  gravity  for  each  category  are  defined,  the  radii 
Kg  of  circles  (or  spheres)  that  enclose  all  the  input  data  that  belong  to  a  certain  category 
can  be  found  from  the  following  equation: 


One  more  parameter,  must  be  defined  in  order  to  transform  n-dimensional  input 
data  into  (n+l)-dimensions.  The  maximum  distance  of  an  input  point  fiom  the  (xigin, 
,  is  used  to  scale  all  the  input  data  in  transforming  them  into  new  (n+  l)-dimensional 
input  arrays.  Finally,  the  series  of  the  following  equations  is  utilized  for  the  transformation 
of  input  patterns. 

z,  =  cosa, 

Zj  =sina,  cosUj 

Z3  =  sina,  sinUj  cosa, 

z,  =  sina,  sinttj  •  •  •  sina^,  cosa, 

z^j  =  sina,  sinaj-^sina^,  sina,  (3) 


where  z,  (i  =  12,*‘*,n+l)  are  the  new  (n+l)-dimensi(mal  transformed  input  data  space 
and  the  arguments  ay  (y  =  l,2,’--,n)  are  defined  by 


Notice  that  (/i+  l)-dimensional  input  data  ate  mapped  such  that 


i«i 


(4) 


(5) 
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By  en4>loying  relations  (3)  and  (4)  all  the  n-dimensional  input  data  including  the  centers  of 
gravity  of  all  the  categtxies  can  be  transfonnecL  In  order  to  find  discriminant  functions 
for  the  separation  of  patterns,  equadons  of  deciskm  surfaces  (separation  planes)  are 
required.  Nodce  that  the  (/i-f  l)-<h]iiensional  data  at  the  center  of  gravity  represent  the 
normal  vector  of  a  separation  plane;  Le., 


(6) 


Once  some  point  which  lies  on  this  separation  plane  is  known,  the  equatkm  of  the  plane 
can  be  established.  The  point  which  lies  on  the  boundary,  rg,  of  the  cniginal  n-dimensi(mal 
pattern  should  be  mapped  onto  the  edge  of  the  boundary  in  the  (n-t-  l)-dimensi(mal  pattern, 
and  hence  this  point,  shoucld  be  used  in  formulating  the  equations  of  the 

discriminant  functions.  Thus,  the  discriminant  fiinctums  can  be  given  by 


n*  (z  ^EDGEg )  ~  ® 

The  above  transformation  can  be  used,  not  only  for  the  condition  with  linear  separalnlity 
of  patterns,  but  also  for  the  case  of  lineariy  nonseparable  patterns.  The  analytical  weights 
for  the  neuron  being  activated  for  the  R-th  category  can  then  be  given  as 

(8) 


m.  Simulation  Results 


The  following  simple  example  will  allow  the  reader  to  gain  better  insight  into  the 
discussion  of  the  pattern  classification  issiw.  Now  consider  looking  at  two-dimensional 
patterns  (r=1)  with  two  categories  (R-2).  Initially,  seventeen  patterns  were  assigned  in 
two-dimensional  pattern  space  according  to  their  membership  in  sets  as  follows; 


'si  fioi  ri2i| 
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In  order  to  classify  the  given  patterns  into  two  categories  with  the  ordinary  decision  lines, 
discriminant  functions,  at  least  four  decision  lines  and  a  two-layer  neu^  network  are 
necessary  as  shown  in  Hgure  1.  Chi  the  other  hand,  (xily  one  neuron  is  necessary  to 
perform  the  same  function  if  the  proposed  design  is  employed  since  it  is  possible  to 
separate  the  two  categories  with  one  circle  as  shown  in  Hgure  2.  Hgure  3  illustrates  the 
given  input  data  after  they  have  been  transformed  into  three-dimensional  patterns. 
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Hgoie  1.  Two-Dunensioiud  Patterns  with  Qidinaty  Discriminant  Functions  and  the 

Two-Layer  Neural  Network. 


Hguie  2.  Two-Dimensional  Input  Patterns  with  the  bnpioved  Model  and  the  Single- 
Neuion  N^wock. 


Hgure  3.  Iiqnit  Patterns  After  the  Transformation. 

For  the  given  single-neuron  network,  the  analytical  weight  vector  can  be  found  by  utilizing 
relation  (8).  This  vector  is  given  as 


W.  .  =[0.5305  0.5180  0.6710  -0.8008] 


The  popular  error  back-propagaiion  training  algorithm  [5][6]  (delta  training  for  a  aingle- 
li^cr  network)  can  also  be  uttlired  to  coo^nite  the  optimal  weight  vector.  The 
pcrfonnance  ai  the  neural  network  will  then  be  compared  widi  both  the  analytical  and  the 
delta  training  wei^tts.  Initial  wd^ts  for  training  are  randomly  chosen  as  in  the  normal 
procedure,  and  the  total  output  error  does  converge  towards  zero  as  riiown  in  Hguie  4. 
After  the  training  oi  die  netvrork  with  500  iteratioas,  the  delta-trained  weight  vector  is 
found  to  be 


*[0.4448  0.5034  0.7408  -0.7890] 

Testing  of  pattern  classifications  is  achieved  with  the  two  weight  vectors  listed  above,  and 
die  results  are  tabulated  in  Hgure  S.  Figure  6  illustrates  the  mesh  plots  of  the  actual  input- 
oitfput  nonlinear  nuq^gs  in  the  original  pattern  space. 


Figure  4.  Total  Ouqiut  Error  of  the  Neural  Network  with  the  Delta  Training. 
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Figure  3.  Test  Results  of  the  Neural  Network. 
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Aciod  Ou^otTniisfer  Chaicia^^  (a)  Using  the  Analytical  Wdglit  Vector 
and  (b)  Using  the  Weight  Vector  with  die  Delta  Training. 

As  can  be  seen  fiom  this  simulation,  the  proposed  design  of  input  data 
trmsfosmatkn  for  pattern  classifications  works  well  and  is  superior  to  the  oidinaiy  pattern 
dassifirationdesigna 

IV.  Coadtariom 

An  improved  tedinique  of  input  data  transfbnnation  for  effective  pattern 
clasaificatioo  with  a  mminiiim  number  of  neurons  has  been  presented  in  this  p^ier.  The 
shnnlMion  results  dearty  demonstrated  that  the  number  of  necessary  neurons  could  be 
effectively  nanimiaed  widi  accurate  classifications  of  input  patterns  by  introducing  an 
addhional  dmension  (freedom)  for  the  input  patterns.  The  smulaiion  revealed  that  even 
with  a  simple  linearly  noosqiaratrie  example  tite  number  of  neurons  required  was  reduced 
by  using  die  improved  method.  In  fact,  the  number  of  necessary  neurons  for  the  exan^le 
case  used  was  reduced  by  four. 

Some  possible  further  studies  indude  testing  of  dus  design  to  higher  dimensions 
with  more  complicated  patterns  that  have  a  greater  number  of  classifying  categories 
ahhoo^  this  input  transformation  technique  can  be  virtually  applied  to  any  pattern 
classification  problems  of  any  dimaisions.. 


^tc^hrmeas 

[1]  J.  M.  Zonda,  InwiAictkw  lo  Aiiifidal  Neuial  Sva^  West  PuMishing  Company,  St  Paul, 

MN..1992 

(21  a  C.  Andrews.  IremdllcrinB  m  in  PiillMTi  Ri-mimitinn  Witev 

bMenckaoe,  New  YoriL  NY-  1972. 

[3]  B.^K^drawandM.  A.  Lelv.  ”30  Yews  of  Adaptive  Neural  Netwoifcs:Pefoeptron, 

Medaline,  and  BackpropafMiQO,”  IEEE  Pmctedings,  vol.  78,  no.  9,  pp.  1415-1442,  Sept  1990. 

[4]  M.  ffimfcv  and  S.Pape»t.PEttjentions..  MIT  Pleas- Cambridge.  MA..  1969. 

[5]  K.  Honik,  M.  Sdnchcombe,  and  H.  While,  *Mnltilayer  Feedforward  N^oria  Are  Universal 
Appwwimaiioiia,"  Nemi  Networks,  voL  2.  pp.  359-366, 1989. 

[6]  ED.  Karin,  "A  Simple  PlocedurefoPran^  Back-Propagation  Trained  Neural  Networks,” 
IEEE  Trwa.  OH  Nettrei  Networks,  voL  1,  no.  2,  v?.  239-242, 1990. 


in-672 


Fundamentals  of  the  Bootstrap  Based  Analysis  of  Neural  Network's  Accuracy. 


Alex  S.  Katz  ^ 

Simon  Katz  2 
Norma  Lowe  * 

*  Baxter  Healthcare  Corporation,  Edwards  CVS  Division, 
17221  Red  Hill  Ave.  MS-33,  Irvine,  CA  92714 
(714)250-2546 
(714)  250-3579  FAX 

2  University  of  Southern  California,  Geology  Dept.,  Los  Angeles,  CA  90033 


Abstract. 

Neural  networks  and  bootstrap  methodologies  were  combined  into  a  single  system 
(NNB),  creating  a  powerful  predictive  neural  engine  together  with  the  means  of  statistical 
estimation  of  the  accuracy  of  the  prediction  errors.  The  NNB  system  was  tested  on  clinical  data 
obtained  from  a  clinical  trial  of  implanted  artificial  heart  valves.  The  system  correctly  predicted 
78%  of  the  valve  related  deaths  in  the  time  period  of  1981-1991,  for  a  patient  sample  of  789, 
based  solely  on  the  information  available  preoperatively.  Distribution  of  the  prediction  error  and 
its  variation  in  relation  to  selection  of  the  training  set  was  observed  and  analyzed,  based  on  1300 
bootstnqp  replications  of  the  neural  net's  cycle  of  training  and  testing.  Expectation  of  the 
prediction  error  along  with  the  confidence  intervals  and  prediction  intervals  for  the  error  were 
computed. 

Introduction. 

Artificial  adi^tive  neural  networics  provide  a  powerful  tool  that  has  been  used 
successfully  in  image  processing,  pattern  recognition,  natural  phenomena  prediction  and  signal 
processing  [1, 3,5,6].  The  encouraging  results  of  ^plication  of  the  neural  techniques  to  a  number 
of  problems  poses  a  qiiestion  of  estimation  of  reliability  of  the  neural  networks'  performance. 
Because  of  the  neural  systems'  nonlinearity  and  structural  complexity,  the  classical  statistical 
theory  provides  little  help  in  analyzing  the  performance  and  accuracy  of  predictions  of  neural 
nets.  This  gap  may  be  successfully  filled  using  such  methods  of  computational  statistics  as 
bootstr^. 

The  bootstrap  is  a  computer-based  method  of  statistical  inference  for  assigning  measures 
of  accuracy  to  statistical  estimates[8,9].  Bootstrap  methods  can  assess  accuracy  measures  such 
as  biases,  prediction  errors  and  confidence  intervals[7,8].  The  bootstnq>  algorithm  generates  a 
large  number  of  bootstrap  samples,  obtained  by  resampling  with  replacement  of  the  original 
data.  For  each  bootstrap  sample  a  corresponding  bootstrap  replication  of  the  statistic  of  interest 
is  calculated[7-10].  The  accuracy  measures  of  interest  are  then  estimated  from  these  bootstrap 
replications.  For  instance,  if  s{x)  is  the  statistic  of  interest,  and  its  standard  error  is  to  be 
estimated,  then  the  bootstrap  estimate  SE^j^ot  can  be  computed  according  to  the  following 
formula: 
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u 

sebool  =  {^[s(i‘‘)-s()f/(B-l)}''^ 

a 

Where  s()  =  2]s(x**)/5 


(1) 

(2) 


B  is  the  number  of  bootstrap  replications,  and  x**  is  a  bootstrap  resampling  of  the  original  data 
X  =  (Xl,X2,.,.,Xi.) 

Application  of  the  bootstrap  methodology  provides  the  means  of  estimating  the  standard 
error  and  computing  prediction  intervals,  as  well  as  confidence  intervals  for  the  estimated  mean 
of  the  error  of  neural  network's  prediction  [2].  This  ^proach  combines  powerful  prediction 
cq>abilities,  provided  by  the  neural  networks,  with  the  ability  of  estimating  the  results  of 
prediction  and  prediction  errors,  which  in  its  turn  enables  us  to  optimize  the  predictive 
mechanism,  minimizing  the  error  and  maximizing  its  predictive  power.  Additionally,  this 
i^roach  allows  for  further  optimization  of  the  neural  networic  and  increasing  its  predictive 
cq)ability  and  accuracy  of  prediction. 

Bootstrap  estimation  of  prediction  errors. 

Available  data  Q  consists  of  pairs  (input,  expected  output)  (xi,Zi).  Q  is  randomly  divided 
into  two  non  empty  subsets  and  Qts,  such  that 

£2„  I  Q.=0 

Bootstn^  training  and  jti.ig  samples  (UtrU)  snd  cufsO)  are  generated  from  and  Qts>  where 
j  is  the  index  of  the  bootstrap  replication. 

A  neural  networic  is  aefined  by  the  vector  of  its  parameters  u,  and  the  output  F  of  the  neural  net 
is  a  function  of  ihe  net’s  input  and  parameters;  F  =  F(n,  x). 

For  each  bootstrs^  sample  of  the  training  set  (OfrU)  fhe  neural  net  is  initialized  and  trained  to 
minimize  the  approximation  error 

s  («.  j)  =  2  II  11^  W 

The  parameter  vector  u  j  of  this  neural  network  is  defined  by  the  following  characteristic 
equation 

e(uy,y)=  Mw  8(u,y)  (5) 

tt 

The  corresponding  bootstr^  estimate  of  the  neural  net’s  prediction  error  computed  over  (O/^Cy) 
is 

?  («/)  -  E  II  ll'  («) 

After  generating  bootstrsq)  samples  (O/rCy)  ^tsU)  of  the  training  and  testing  sets  for 
j  €[1,  J],  and  estimating  the  network’s  prediction  errors  ^  (u^),  random  variable  statistics  may 
be  computed  per  (1)  and  (2)  and  distribution  parameters  for  %  (uy)  may  be  estimated  and 
analyzed. 

The  expectation  of  the  prediction  error  may  be  estimated  as  the  average  of  bootstrap  estimates  for 

^(ny)  . 
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(7) 


I  =  («,) 

Standard  error  of  the  prediction  error  may  be  estimated  as 

>-l  J 

Depending  on  the  size  of  the  training  set  and  the  empirical  distribution  of  the  estimate  of  the 
measure  of  interest  (prediction  error  4  (Uy)  in  our  case),  confidence  intervals  for  the  actual  value 
of  the  estimated  parameter  may  be  estimated  by  different  methods.  In  this  study  the  bootstrap 
percentile  method  was  used.  If  4*^“’  is  the  100  a  percentile  of  J  bootstrap  replications  ^  (j)  , 
y=l  then  the  interval  of  intended  coverage  1  -  2a  is  obtained  by 

(5^.  ?■"-’)  (9) 


Neural  network  architecture. 

A  fully  connected  feed-forward  backpropagation  neural  network  was  integrated  into  the 
NNB  system  for  this  study  in  the  manner  described  in  the  Experiment  Design  section.  Eight 
variables,  available  preoperatively  on  both  the  patients  and  the  implanted  devices,  were  used  as 
the  network's  inputs.  Since  the  neural  net  had  to  be  able  to  train  on  small  samples,  the  cost 
function  C(  A),  characterizing  the  efficiency  of  the  net,  had  to  be  modified  into  C'(  A),  where 

C(A)  =  j;{N(X*)-D(^)}^  (10) 

C(A)  =  C(A)  +  S^llAlP,  (11) 

A  is  the  vector  of  the  neural  net's  parameters,  Xm  is  the  k-th  vector  of  input  parameters,  N(Xt)  is 
the  output  of  the  neural  net,  and  D(^)  is  the  expected  output  of  the  net. 

Experiment  Design. 

The  neural  net  ~  bootstrap  methodology  was  tested  on  clinical  data  obtained  from  a 
clinical  trial  of  patients  implanted  with  artificial  heart  valves.  Seven  hundred  eighty-nine  (789) 
patients  implanted  with  Carpentier-Edwards  ®  Pericardial  bioprosthesis  have  been  followed 
from  1981.  The  aim  of  this  study  was  to  predict  which  patients  would  develop  device-related 
complications,  serious  enough  to  cause  death,  in  the  time  interval  from  1981  to  1991,  based  only 
on  preoperative  patients'  information  and  the  implanted  devices'  characteristics. 

Patient  records  were  divided  into  two  groups  —  f2i  containing  patients'  records  indicating 
a  "positive"  response,  i.e.,  death  from  a  valve  related  complication  between  1981  and  1991;  and 
Q2  containing  records  with  "negative"  responses,  i.e.,  no  valve-related  death.  A  predetermined 
number  (30)  of  "positive"  responses  were  randomly  selected  with  replacement  from  Qi  and 
assigned  to  the  training  set.  An  equal  number  of  "negative"  responses  were  randomly  selected 
from  Q2  and  added  to  the  training  set.  From  the  remaining  records,  both  "positive"  and 
"negative",  random  sampling  with  replacement  was  used  again  for  selecting  testing  records. 
Sampling  of  "negative"  and  "positive"  responses  for  the  training  set  was  performed 
independently,  in  order  to  compensate  for  the  relatively  small  proportion  of  the  "positive" 
responses  (60  out  of  776).  The  neural  net  vfas  synthesized  using  data  from  the  training  set.  The 
data  from  the  testing  data-set  was  then  used  for  estimation  of  die  neural  net  efficacy,  errors  of 
prediction  and  optimization  of  the  neural  net.  Then  training  and  testing  sets  were  reselected,  the 
networic  retrained,  and  the  prediction  error  estimated.  A  single  NNB  cycle  included  a  bootstr^ 
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network  retrained,  and  the  prediction  error  estimated.  A  single  NNB  cycle  included  a  bootstrap 
generation  of  the  training  and  testing  sets,  synthesis  of  the  NN,  testing  NN  and  estimation  of  the 
prediction  accuracy.  After  repeating  the  process  for  1300  cycles,  the  calculated  errors  were 
accumulated  and  analyzed.  Since  the  NNB  system  included  the  neural  network,  which  had  to  be 
synthesized,  trained  and  tested  hundreds  of  times,  the  process,  being  very  computationally 
intensive,  called  for  a  very  efficient  adaptive  neural  network.  The  neural  net  for  the  NNB  system 
was  designed  and  developed  by  MuItiSpectrum  Technologies,  Inc.  of  Santa  Monica,  California. 

Classification  methodology. 

When  the  trained  network  predicted  a  valve  related  death,  where  no  such  event  was 
indicated  by  the  clinical  records,  we  called  it  "false  alarm".  On  the  other  hand,  if  a  patient  died 
from  a  valve  related  complication  and  the  event  was  not  predicted  by  the  network,  we  called  it 
"missed  event".  Two  separate  errors  of  prediction  were  computed  for  each  cycle  of  NNB.  The 
errors  were  calculated  as  ratios  of  incorrectly  predicted  events  (either  "false  alarms"  or  "missed 
events")  to  the  expected  number  of  events  of  the  same  type,  as  indicated  by  the  clinical  records. 
These  two  errors  were  then  averaged  to  obtain  a  single  score  representing  the  prediction  accuracy 
for  a  given  NNB  cycle. 

A  step-function  S(h,  z),  taking  values  0  and  1,  was  used  for  evaluating  the  networks' 
output.  It  transformed  the  networks'  continuous  output  into  a  dichotomous  variable  —  predicted 


event  or  predicted  no-event. 

S(A,z)=l,  z>h  (12) 

S(/i,z)  =  0,  z<h  (13) 

Then  correct  predictions  are  defined  as 

S(h,z)  =  l  forxeQ,  or  S(h,z)-Q  forxeQj  (14) 

False  alarms,  according  to  our  definition  are  then  described  as 

S(/i,z)=l  forxeQj  (15) 

and  missed  events 

S(/j,z)  =  0  forxeQ,  (16) 


The  threshold  of  the  step-function  —  h  —  is  one  of  the  networks'  parameters,  and  modifying  its 
value,  the  error  rates  may  be  adjusted. 


Results. 

In  Figure  1  the  rates  of  "false  alarms"  and  "missed  events"  are  plotted  against  the  step- 
function's  threshold  h.  As  the  value  of  the  threshold  increases,  the  number  of  "missed  events" 
increases  as  well,  while  the  number  of  false  alarms  goes  down.  In  Figure  3  the  relationship 
between  the  averaged  error  of  prediction  and  the  step-function's  threshold  is  investigated.  In  this 
case  the  averaged  error  of  prediction  attained  its  minimum  when  h  =  0.44,  and  the  value  of  the 
error  is  below  0.23,  which  gives  us  approximately  78%  accuracy  in  prediction.  Figure  2  presents 
a  histogram  of  the  averaged  errors,  obtained  from  1300  NNB  cycles  for  a  fixed  value  of  the  step- 
functions'  threshold  h  =  0.44.  Based  on  the  values  of  the  bootstrap  replications  of  the  NNB- 
generated  averaged  error,  parameters  of  the  error  distribution  were  estimated  and  both  the 
prediction  interval  for  the  NN  error  and  confidence  interval  around  its  mean  were  constructed 
[2,7,8,10).  The  mean  of  the  averaged  error  was  estimated  at  0.22  (SE  mean  =  0.0015)  and  the 
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Conclusion. 

Neural  networks  and  bootstrap  were  combined  into  one  algorithm  to  bring  the  best  of  the 
two  worlds  together:  a  powerful  predictive  engine  of  the  neural  nets'  paradigm  and  the  proven 
ways  of  statistical  methods  for  assessing  prediction  accuracy.  This  combination  of  the  two 
methods  also  proved  to  be  effective  for  further  optimization  of  the  neural  network  and  increasing 
its  predictive  power.  Based  on  the  preoperative  information  only,  78%  of  patients  were  correctly 
classified  by  the  neural  network  into  two  groups;  those  that  would  die  from  an  implanted  device 
related  cause  and  those  that  would  not.  The  averaged  error  of  prediction  was  22%  (+/-5%)  and 
the  prediction  interval  for  the  prediction  error  was  estimated  to  be  (14%,  32%)  at  the  95% 
confidence  level. 
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Abstract: 

It  is  commonly  accqited  that  the  modification  of  the  weights  during  training  of  an  Artificial  Neural 
Network  can  be  augmented  by  addition  of  a  random  element  chosen  from  various  distributions.  This  technique, 
commonly  referred  to  as  Noise  Injection,  allows  the  training  process  to  stochastically  traverse  a  larger  area  of  the 
sample  space,  as  well  as  escape  from  local  minima.  Numerous  noise  distributions  and  intensities  arc  used  and 
training  can  be  shov-n,  experimentally,  to  be  more  reliable. 

Tk<^  origin.ll  paper  examined  the  effect  of  noise  injection  on  the  training  cycle  of  feed-forward  neural 
networirs.  Emphasis  was  placed  on  the  gradient  descent  weight  modification  technique  of  the  Batkprqtagaficn 
model.  Statistical  e:mmioation  was  made  of  the  distribution  of  the  effect  within  the  topology  of  the  weight  qrace, 
upon  the  ouqwts  of  individual  units,  and  on  the  total  error  of  the  network.  Siixx  the  weights  of  the  network  can 
be  considered  togctlxr  as  an  n-tuple,  injection  of  noise  can  be  statistically  examined  within  that  n-dimensional 
space.  It  was  sbowii  that,  for  stochastically  independent  random  distributions,  the  effect  on  this  weight  q»ce  is 
dqiendent  upon  the  number  of  weights  in  the  network.  The  multivariate  distribution  of  the  vector  modificalion 
during  training  becemes  increasingly  distorted  as  the  network  size  increases,  such  that  noise  injection  has  a  more 
significant,  and  less  stable  effect,  as  network  size  iiKieases.  For  each  individual  nodal  output  it  can  be  shown  that 
the  random  distribution  of  change  in  the  output  of  the  tmde  is  a  function  of  the  multivariate  distribution  of  all 
weights  preceding  tl:e  neuron,  the  applied  pattern,  aixl  the  neuron  activation  fimetion.  This  distribution  was  also 
shown  to  be  dependent  upon  network  size. 

The  effect  cn  error  ouqrut  of  the  network  is  a  composite  function  of  the  effect  on  each  ouqnit  unit  This 
distribution  was  examined  in  detail.  The  paper  lodced  at  the  conunon  independent  imiform  and  normal  noise 
distribution  injection  in  detail.  Problems  with  these  traditional  ai^roaches  were  examined  and  an  alternative  noise 
injection  method  ba::ed  on  an  n-tuple  vector  modification  was  proposed  that  was  less  dependent  on  network  size. 
The  study  found  that  uniformly  dikributed  txnse  led  to  faster  convergence  but  less  reliable  than  that  of  normally 
distribute  noise.  Additionally,  generalization  of  the  sine  wave  provided  (overall)  better  approximation  when  the 
noise  was  normally  distributed  than  when  the  noise  was  uniformly  distributed. 
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Abstract 

In  the  present  paper,  we  propose  a  minimum  information  principle  for  the  improvement  of 
the  generalization  performance.  This  principle  states  that  the  information  about  input  patterns 
must  be  as  small  as  possible  for  improving  the  generalization  performance  under  the  condition 
that  the  network  can  produce  targets  with  appropriate  accuracy.  The  information  is  defined 
by  the  difference  between  maximum  entropy  or  uncertainty  and  observed  entropy.  Borrowing  a 
definition  of  fuzzy  entropy,  the  uncertidnty  function  is  defined  for  the  internal  representation  and 
represented  by  the  equation:  -Vi  logvi  -  (1  -  Vj)iog(l  -  Vi),  where  is  a  hidden  unit  activity. 
After  having  formulated  an  update  rule  for  the  minimization  of  the  information,  we  applied  the 
method  to  a  problem  of  language  acquisition:  the  inference  of  the  past  tense  forms  of  regular 
verbs.  Experimental  results  confirmed  that  by  our  method,  the  information  was  significantly 
decreased  and  the  generalization  performance  was  greatly  improved. 


1  Introduction 

Many  techniques  have  been  developed  for  the  improvement  of  the  generalization  performance.  One  of 
the  most  popular  methods  consists  in  the  reduction  of  complexity  in  network  architectures,  for  example, 
addition  of  weight  decay  or  weight  pruning,  [4],  [6],  [7].  If  network  architectures  are  too  complex,  they  can 
store  everything  including  noises  in  addition  to  the  necessary  part  of  input  patterns.  If  the  complexity  is 
too  small,  it  is  impossible  to  learn  input  patterns.  Thus,  it  is  necessary  to  adjust  the  complexity  of  network 
architectures  to  ^ven  problems. 

Let  us  see  the  complexity  problem  from  an  informational  theoretical  point  of  view.  Suppose  that  the 
complexity  represents  a  kind  of  information  capacity  of  networks.  If  the  information  regarding  training 
input  patterns  is  excessively  stored,  meaning  that  networks  store  every  details  of  training  patterns,  the 
generalization  performance  is  not  improved.  Thus,  the  reduction  of  complexity  shows  that  the  information 
about  training  patterns  is  appropriately  stored  in  network  architectures  so  as  to  improve  the  generalization 
performance. 

In  the  present  paper,  we  would  like  to  show  that  for  the  improvement  of  the  generalization  performance, 
the  information  to  be  stored  in  network  architectures  should  be  as  small  as  possible,  under  the  condition 
that  networks  can  learn  the  trmning  input  patterns  with  appropriate  accuracy.  This  statement  is  referred 
to  as  Minimum  Information  Principle  for  the  improvement  of  the  generalization  performance. 

For  demonstrating  this  hypothesis  of  the  minimum  information  principle,  let  us  define  the  information, 
stored  in  network  architecture.  Information  can  be  defined  by  the  difference  between  maximum  entropy  and 
observed  entropy: 

I  =  (1) 
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Figure  1:  Entropy  a  function  of  hidden  unit  activity  v,-. 


where  /f”***  is  a  maximum  entropy  and  H  is  an  observed  entropy.  The  information  means  how  much 
uncertainty  is  decreased  by  training  networks  with  training  patterns.  Using  this  information,  our  objective 
is  to  show  that  the  information  must  be  as  small  as  possible  for  the  improvement  of  the  generalization 
performance,  under  the  condition  that  networks  can  produce  targets  with  appropriate  accuracy. 

2  Theory  and  Computational  Methods 

We  have  defined  an  entropy  for  the  internal  representation.  Following  Bridal  and  Deco  [1],  [2],  we  suppose 
that  an  activity  of  a  hidden  unit  represents  a  probability  that  a  given  input  pattern  belongs  to  a  certmn 
class.  If  Vi  represents  an  activity  of  tth  hidden  unit,  this  activity  means  the  probability  that  a  given  input 
patterns  belongs  to  the  class  t.  Suppose  that  the  ordinary  sigmoid  activation  function  (0,1)  b  used  to  produce 
outputs.  The  most  uncertain  state  is  a  state  in  which  the  hidden  unit  produces  an  activity  close  to  0.5.  In 
thb  case,  it  is  impossible  to  tell  whether  the  input  pattern  belongs  to  class  t  or  not.  On  the  other  hand, 
the  most  certain  state  is  a  state  in  which  the  hidden  unit  produces  an  activity  close  to  1  (the  input  pattern 
certunly  belongs  to  the  class)  or  0  (the  input  pattern  does  not  belong  to  the  class).  If  Hi  represents  this 
uncertainty,  one  of  the  possible  candidates  is  formulated  as  follows: 

Hi  =  -Vi  log  Vi  -  (1  -  Vi)  log(l  -  v<). 

Thb  equation  is  equivalent  to  the  well-known  fuzzy  entropy  [3].  As  you  can  see  from  Figure  1,  the  function 
Hi  reaches  the  maximum,  when  the  activity  v,-  b  0.5,  the  most  uncertain  state. 

With  thb  entropy  function,  the  information  b  defined  by 

j.  =  _  Hi. 

Thb  information  means  the  information  content,  stored  in  a  hidden  unit  for  an  input  pattern.  Using  this 
definition  of  information,  our  objective  is  to  minimize  thb  information  as  much  as  possible,  under  the 
condition  that  networks  can  produce  targets  with  appropriate  accuracy. 

Our  entropy  function  is  defined  with  respect  to  a  hidden  unit  activity  and  shows  uncertainty  or  ambiguity 
of  the  function  of  the  hidden  unit.  Our  learning  rule  b  to  maximize  thb  uncertainty  or  ambiguity  as  much 
as  possible. 
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Let  U8  formulate  an  entropy  function  for  the  internal  representation.  Suppose  that  a  network  is  composed 
of  three  layers:  input,  hidden  and  output  layers.  Hidden  unit  activities  are  denoted  by  vi  and  input  terminals 
by  (i-  Then,  input-hidden  connections  are  denoted  by  wij. 

A  hidden  unit  for  fcth  input  pattern  produces  an  output 

«?  =  /(«?). 


where  /  is  a  sigmoid  activation  function,  defined  by 


and  where  is  a  net  input  to  tth  hidden  unit  and  defined  by 


where  is  »th  element  of  an  input  pattern.  An  entropy  is  defined  by 

K  M 

k  i 


(2) 


where  K  is  the  number  of  input  patterns  and  M  is  the  number  of  hidden  units.  Using  this  entropy,  the 
information  content  is  defined  by 


/  = 

=  A'Miog2  +  log  of  -K1  -  v^)  log(l  - 


(3) 


k  i 


where  M  is  the  number  of  hidden  units.  Now,  suppose  that  the  squared  error  function  can  be  defined  by 

e=5EE«.* -»?)’. 


k  % 


where  C*  is  a  target  for  an  output  of.  Using  this  error  function,  total  function  to  be  minimized  is  formulated 
as  follows: 

where  E  is  the  squared  error  function.  Diflferentiating  both  sides  of  this  equation,  we  have  the  following 
update  rule: 


Awij 


dF 

dwij 

k 


where 

=  t,f(l-vf)log(i^) 
vf 

and  j  is  an  ordinary  delta  for  the  back-propagation. 


(4) 


3  Results  and  Discussion 

3.1  Data  and  Network  Architectures 

We  applied  our  information  minimization  method  to  the  well-known  problem  of  language  acquisition  [5]. 
This  problem  is  quite  significant  from  a  linguistic  point  of  view  and  it  is  easy  to  compare  our  results  with  the 
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Figure  2:  Information  as  a  function  of  the  number  of  epochs  for 
five  different  values  of  the  parameter  a. 


previous  results  on  the  past  tense  acquisition.  Detuls  on  the  truning  and  testing  patterns  are  omitted  for 
the  simplicity  of  the  explanation.  We  attempt  only  to  show  how  networks  were  trained  to  produce  targets. 

In  this  problem,  networks  are  trained  to  produce  correct  past  tense  forms.  For  example,  a  string  /pat/  is 
given  to  the  networks.  FVom  the  grammar  of  our  artificial  language,  the  correct  past  tense  form  is  /patid/. 
Thus,  networks  must  produce  this  correct  past  tense  form  after  finishing  the  learning. 

In  actual  problems,  input  strings  were  represented  in  the  phonological  representation  [5]  and  the  number 
of  input  units  was  18  units.  The  number  of  hidden  units  was  20.  The  number  of  output  units  was  20  for  the 
inference  of  r^ular  verbs.  The  number  of  training  patterns  was  100  and  the  number  of  testing  patterns  was 
500.  Networks  started  to  learn  with  initial  random  values  (-0.25,  0.25).  The  parameter  for  the  momentum 
term  was  fixed  to  0.9  for  all  the  experiments.  The  learning  was  performed  by  using  the  so-called  Batch 
learning,  meaning  that  weights  were  updated  after  processing  all  the  input  patterns.  The  learning  was 
conudered  to  be  finished,  only  when  the  epochs  were  200. 

3.2  Inference  of  Regular  Verbs 

In  this  section,  we  attempt  to  show  that  by  increasing  the  value  of  the  parameter  a,  the  information,  stored 
in  the  internal  representation  is  decreased  and  the  generidization  performance  is  significantly  improved. 

Figure  2  shows  the  information  as  a  function  of  the  number  of  epochs,  when  the  number  of  hidden  units 
was  20.  The  information  was  normalized  by  the  following  equation: 


Nn»«» 

K  M 

=  +  (5) 

Thus,  the  range  of  this  normalized  information  is  [0,1].  If  the  information  is  1,  the  information  is  maximized. 
On  the  other  hand,  if  the  information  is  zero,  the  information  b  minimized.  As  you  can  see  from  the  figure, 
the  information  is  increased  quickly  at  the  first  part  of  the  learning,  and  then  decreased  gradually.  As 
the  parameter  a  b  increased,  the  information  b  more  significantly  decreased  and  close  to  zero,  minimum 
information  state.  Thb  shows  that  the  information  minimization  method  b  quite  effective  to  decrease  the 
information  content  in  the  internal  representation. 

Now,  let  us  see  how  the  generalization  errors  can  be  improved  by  using  the  method  of  information 
minimization.  Figure  3  and  Figure  4  show  truning  and  generadization  errors  as  a  function  of  the  number 
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of  epochs  respectively.  Training  errors  (T*'™)  and  generalization  errors(Cr"’'’")  were  computed  by  using 
Hanuning  distance  between  targets  and  outputs  at  output  units.  For  example,  the  generalization  errors 
(Q»rm)  normalbed  as  follows: 

k  i 

where  o*  is  an  output  at  ith  output  unit  for  A;th  input  pattern  for  the  testing  patterns,  is  its  target, 
N  is  the  number  of  output  units,  S  is  the  number  of  testing  patterns,  and  i4(x)  is  1  for  z  >  0.5  and  0 
for  X  <  0.5.  Let  us  see  Figure  3  for  the  training  errors.  As  you  can  see  from  the  figure,  training  errors 
are  decreased  gradually  and  finally  zero  both  for  the  standard  back-propagation  (a=0)  and  information 
minimization  method.  Little  difference  can  be  seen  in  the  training  errors.  However,  for  the  testing  data,  a 
big  difference  can  be  seen.  Figure  4  shows  generalization  errors  for  the  testing  data.  As  the  parameter  a 
is  increased,  the  information  is  decreased  significantly.  We  can  clearly  see  that  the  generalization  is  much 
improved  by  using  the  information  minimization. 

4  Conclusion 

In  the  present  paper,  we  have  proposed  the  minimum  information  principle.  This  principle  states  that  for  the 
good  generalization  performance,  the  information  stored  in  network  architecture  must  be  as  small  as  possible, 
under  the  condition  that  networks  have  the  sufficient  capadty  to  learn  the  input  patterns  with  appropriate 
accuracy.  We  have  formulated  the  entropy  function  for  the  internal  representation  and  the  update  rule  to 
minimize  the  information  content.  By  applying  the  method  to  the  problem  of  language  acquisition,  we  have 
demonstrated  that  the  generalization  is  really  improved  by  minimizing  the  information  content,  stored  in  the 
internal  representation.  We  think  that  many  techniques  concerning  the  improvement  of  the  generalization 
performance  can  be  incorporated  into  the  framework  of  the  minimum  information  principle. 
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Abstract 

We  propose  a  method  that  scales  data  to  a  range  that  is  appropriate  for  presentation  to  a  neural 
network,  and  takes  into  consideration  its  actual  probability  distribution.  This  method  can  be 
applied  to  any  data  set,  even  when  there  is  no  prior  knowledge  of  its  underlying  distribution.  We 
have  found  t^t  such  data  transformation  greatly  improves  learning  in  a  standard 
backpropagation  network,  and  allows  the  network  to  learn  difficult,  linearly  non-separable 
problems  foat  resist  more  traditional  methods. 

IimoDUcnoN 

When  training  a  feed-forward,  multilayer  neiual  network  foere  are  several  issues  that  arise,  with 
data  representation  being  one  of  nujor  importatKe.  To  represent  real  numbers  we  must  consider 
how  to  scale  and,  if  necessary,  transform  the  data.  Scaling  is  certainly  called  for  in  networks 
whose  outputs  are  other  than  biruuy,  and,  in  general,  learning  is  probably  improved  by  limiting 
the  range  die  teaming  algoritfim  must  traverse.  Transformations,  on  foe  ofoer  hand,  should  be 
considered  whenever  the  variables  have  a  highly  asymmetrical  distribution,  or  greatly  uneven 
variances.  Linear  scaling  under  foese  ciicumstarKes  could  lead  to  loss  of  information,  as  foe  data 
are  unevenly  compressed  or  expanded.  A  common  approach  consists  of  performing  a 
preliminary  exploratory  analysis  of  foe  data,  and  then  applying  a  suitable  transformation. 

We  propose  a  method  foat  scales  foe  data  to  any  range  appropriate  for  presentation  to  a  neural 
network,  while  at  foe  same  time  letting  us  cluster  it  in  a  way  that  can  make  it  more  meaningful. 
We  have  found  that  such  data  transformation  improves  learning  in  a  standard  backpropagation 
network,  allowing  foe  network  to  learn  difficult,  non-Iinearty  separable  problems  foat  resist  more 
traditional  methods. 

TRANSFORMATION  ALGORITHM 

The  purpose  of  foe  transfomution  is  to  create  a  one-to-one  correspondence  between  foe  actual 
data  and  its  transformed  (scaled  and  clustered)  values,  based  upon  foe  data's  probability 
distribution.  The  proposed  algorithm  can  be  used  on  any  bounded,  real  valu^  input  vectors 
with  a  finite  numto  of  elements. 

^  *-  ~  _ X _ )  be  a  vector  of  foe  actual  data,  distributed  in  ascending  order,  with 

and  X _ minimum  and  maximum  values,  respectively.  We  now  divide  this  range  on  JV  equal 

segments,  of  lengths  where 

...AT  (Eq.l) 

The  normalization  function  ^  transforms  all  segments  of  foe  actual  data,  to  foe  corresponding 
clustered  segments  of  normalized  data,  \ ,  according  to  foe  expression: 
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(Eq.2) 


where: 


Xk  =  C*  L*  (Eq.  3) 

Z)*  =  (1-C)/(A/’  — 1),* - I.2....JV-I,  Dn  =  0  (E<1-4) 

Where  is  the  probability  that  an  actual  data  value,  which  belongs  to  the  input  vector  X,  is 
located  in  a  segment  X^,  L  is  die  length  of  die  normalized  input  vector,  and  C  is  a  compression 
factor.  Aldiough  die  data  can  be  normalized  between  any  real  numbers,  for  the  purposes  of  this 
work  the  members  of  die  input  vector  will  be  normalized  between  0  and  1  and,  ^refore,  L  =  . 
The  number  of  clusters  is  defined  hyN;  and  die  value  of  the  compression  factor,  C,  defines  die 

lengdis  of  the  normalized  segments  X^  and  spaces  between  diem  [eq.  (3)  and  (4)],  with 

0<C<1 

Since  ^ =1 ,  the  sum  of  all  transformed  segments  \  equals  L*  =C*L 
*-i  *-i 

(eq.5) 

All  members  (x,*,...x*)  inside  the  segmmt  X^  of  die  input  vector  X  are  homodteticaly 
transformed  to  die  members  (jc  *  *  )  of  die  normalized  segment  correspondingly  to  the 

distance  between  them  and  the  origins  of  the  segments  X/^  and  \  respectively,  with  a  coefficient 
of  similitude  equal  X^  /  X^.  Hence,  if  JC^  €  then  its  transformed  value  €  \  and: 


(Xo-Xk-i)  \ 


(Eq.6) 


where: 


and  are  die  values  of  the  origins  of  the  normalized  segment  \  and  actual  data  segment 
X,^,  respectively. 

From  equation  (6),  and  using  equations  (1-5),  we  derive  a  one-to-one  correspondence  between 
any  actual  value  belonging  to  die  segment  X^,  and  its  transformed  value  x„  belonging  to  die 

segment  \  : 


~  Pk*C*L*ix„-x^_t) 


(Eq.7) 


Where: 


-cy*ik-\) 


and: 


N 


(Eq.8) 


(Eq.9) 
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X  —  the  transformed  value,  corresponding  to  the  real  value  x,  which  belongs  to  the  k~th 
segment. 

Pj,Pi,  —  ttte  probabilities  of  distribution  of  the  variable  x  on  the  j and  k  segments,  respectively. 

N  —  the  number  of  segments  that  the  variable  has  been  divided  into. 

X  min, X  max  -  minimum  and  maximum  values  of  the  variable,  respectively. 

Xj.j  —  die  coordinate  of  the  origin  of  the  ^  -  th  segment. 

L  —  the  length  of  the  normalized  input  vector. 

C  —  a  compression  factor. 

As  shown  by  equation  (3),  the  length  of  the  normalized  data  segments  depends  linearly  on  the 
probability  that  data  are  observed  at  the  corresponding  actual  data  segments;  therefore,  the 
normalization  is  carried  out  without  loss  of  "non-significant”  data  values,  which  could  prove 
important  in  real-life  problems.  Moreover,  this  is  accomplished  without  the  need  for  prior 
knowledge  regarding  the  underlying  distribution. 

The  number  of  segments  is  determined  by  N ,  which  can  take  any  integer  value  >  1 ,  and  it  is 
arrived  at  either  empirically  or  through  the  use  of  more  formal  methods,  such  as  cluster  analysis. 
The  compression  factor,  C,  adds  flexibility  to  the  representation  because  it  determines  the  lengths 
of  the  normalized  segments  and  of  the  spaces,  ^ee  of  data,  between  them.  Hence,  it  allows 
presentation  of  data  to  the  neural  network  as  either  distinctly  quantized  data  sets,  C  »  0,  or  as 
individual  data  members,  C  »  1.  By  tailoring  the  values  of  N  and  C  to  the  individual  variables, 
one  can  arrive  at  the  best  possible  representation  for  any  given  data  set. 

Another  point  that  bears  emphasis  is  that  this  transformation  is  independent  from  the  internal 
network  structure,  making  it  possible  to  present  data  so  transformed  to  any  neural  network, 
regardless  of  the  architecture. 

Experimental  Methods. 

To  test  the  effect  on  network  performance  of  the  transformation  algorithm  we  trained  a  series  of 
multilayer,  backpropagation  networks  to  predict  outcome  following  liver  transplantation.  The 
data  set  used  was  gathered  from  155  liver  transplantations,  and  has  been  described  in  detail 
[Doyle,  et  al.,  1994].  Initially,  ten  different  training/ testing  data  sets  were  prepared  by  random 
subsampling  of  the  original  data.  The  training  sets  consisted  of  138  examples,  while  the  testing 
sets  had  17  examples,  with  the  proportions  of  both  outcomes  (i.e.,  success  and  failure)  being  the 
same  in  both.  By  appl)dng  the  preceding  algorithm  to  these  data,  three  separate  transformed  data 
groups  were  generated  (using  2, 4,  and  10  intervals,  respectively,  and  a  compression  factor  of  0.9). 
A  fourth  group  consisted  of  data  that  were  linearly  scaled. 

The  networks  used  in  these  experiments  had  the  same  architecture,  namely  19  input  neurons,  a 
single  hidden  layer  with  two  neurons,  and  one  output  neuron.  There  were  a  total  of  40  networks 
(one  for  each  training/ testing  set),  and  each  network  was  trained  10  times,  using  different  initial 
random  weights.  The  following  parameters  were  compared: 

•  The  number  of  networks  which  were  able  to  completely  learn  their  training  sets  in  the  course 
of  70,000  epochs 

•  Minimal  training  RMS  errors 

•  Mean  RMS  error,  for  all  networks  and  those  which  learned  their  training  sets  completely 

Results  were  compared  using  one-way  analysis  of  variance  (Scheffe  test),  with  the  level  of 
signihcance  set  at  0.05. 
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Results 

Table  1  shows  the  summary  results: 

Table  1 


Transformation 

Complete 

Min-RMS 

Mean-RMS 

Complete-RMS 

Linear 

39 

0.031 

0.079 

0.045 

2-intervals 

26 

0.024 

0.097 

0.042 

4-intervals 

54 

0.020 

0.074 

0.032 

10-intervals 

75 

0.020 

0.052 

0.028 

Where: 

•  Complete  -  Number  of  networks  that  learned  their  training  set  completely  in  the  course  of 
70,000  epochs  (n=100). 

•  Min-RMS  -  The  minimal  training  RMS  error 

•  Mean-RMS  -  The  mean  of  training  RMS  errors 

•  Complete-RMS  -  Mean  of  training  RMS  errors  among  the  networks  that  learned  their  training 
set  completely  in  the  course  of  70,000  epochs 

The  performance  of  those  networks  working  with  data  transformed  using  10  intervals  was 
superior  to  that  of  those  using  linearly  transformed  data,  in  all  the  categories  examined  (one-way 
analysis  of  variance). 

Conclusion 

Our  results  suggest  that  employment  of  the  proposed  transformation  improves  learning  in 
backpropagation  neural  networks.  This  introduces  another  degree  of  freedom  when  developing 
a  neural  network  model,  which  is  independent  of  network  architecture,  and  offers  the  possibility 
of  representing  data  in  a  manner  that  more  closely  reflects  its  underlying  distribution. 

Since  the  number  of  possible  proposed  transformations  is  theoretically  infinite,  the  obvious 
question  arises:  how  to  chose  the  best  one  for  a  particular  input  variable?  Although  we  are  not 
able  to  answer  this  question  at  present,  we  see  our  future  work  in  this  exploring  the  following 
issues: 

•  Using  well-established  clustering  algorithms  (K-means,  Melting,  etc.)  to  choose  the  most 
appropriate  number  of  intervals 

•  Study  the  effects  of  different  compression  factors  as  we  vary  the  number  of  intervals 

•  Develop  "toy"  networks  to  determine,  for  different  variables  belonging  to  a  given  domain, 
what  is  the  best  combination  of  compression  factor  and  number  of  intervals. 
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ABSTRACT 

Contribution  analysis  is  a  useful  tool  for  the  analysis  of  cross-coiuiected  networks  such  as  those  generated  by  the 
cascade-correlation  learning  algorithm.  Networks  with  cross  coiuiections  that  supersede  hidden  layers  pose  particular 
difficulties  for  standard  atudyses  of  hidden  unit  activation  patterns.  A  contribution  is  defined  as  the  product  of  an 
ouqjut  weight  and  the  associated  activation  on  the  sending  unit  Previously  such  contributions  have  bc^n  multiplied 
by  the  sign  of  the  ouQwt  target  for  a  particular  input  pattern.  The  present  work  shows  that  a  principal  components 
a^ysis  (PCA)  of  unsealed  contributions  yields  more  interesting  insights  than  comparable  analyses  of  contributions 
sededby  the  sign  of  output  targets. 

1  INTRODUCTION 

Stdutions  learned  by  neural  networks  are  often  quite  difficult  to  understand  because  of  the  complex  non-linear 
properties  of  neural  nets  and  the  common  use  of  distributed  representations.  Standard  techniques  of  network  analysis, 
bas^  either  on  a  network's  weights  or  its  hidden  unit  activations  have  been  somewhat  liinited.  TIk  most  notable 
features  of  weight  diagrams  are  often  the  complexity  of  the  pattm  of  weights  and  its  variability  across  multiple 
networits  learning  the  same  problem.  Statistical  antdysis  of  activation  patterns  on  hidden  units  is  limited  to  nets 
with  a  single  hidden  layer  without  cross-connections. 

Cross  coruiections  are  direct  coruiections  that  bypass  intervening  hidden  layers.  They  are  known  to  itKrease  learning 
speed  in  back-propagatkm  networks  (Lang  &  Witbrock,  1988)  and  are  a  standard  feature  of  some  generative  learning 
algorithms,  such  as  cascade-correlation  (Fahlman  &  Lebiere,  1990).  Because  such  cross  connections  carry  so  much  of 
the  work  load,  any  analysis  restricted  to  hidden  unit  activations  provides  at  best  a  partial  picture  of  the  network 
stdution. 

Contribution  analysis  appears  to  be  a  useful  technique  for  multi-layer,  cross  connected  nets.  Sanger  (1989)  defined  a 
contribution  as  the  triple  product  of  an  output  weight,  the  activation  of  a  sending  unit,  and  the  sign  of  the  output 
target  for  that  input  Contributions  are  potentially  more  irtftmnative  than  either  weights  alone  or  hidden  unit 
activations  alone  since  they  take  account  of  both  weight  and  sending  activation.  Shultz  and  Elman  (1994)  used 
priiKipal  components  analysis  to  reduce  the  dimensionality  of  such  contributions  in  several  different  types  of 
cascade-conelation  nets. 

The  present  work  explores  whether  it  is  preferable  to  employ  contributions  that  are  scaled  by  the  sign  of  their  output 
targets  or  to  use  unsealed  contributions  in  network  analysis.  Sanger  (1989)  recommended  scaling  contributions  by 
the  signs  of  output  targets  in  order  to  determine  whether  the  ctmtributions  helped  or  hindered  the  network's  solution. 
However,  since  target  signs  are  not  available  to  networks  except  as  error  correction  signals,  it  could  be  argued  that  it 
is  more  natural  to  use  utucaled  contributions  in  analyzing  krwwledge  representations. 

Understanding  the  knowledge  representations  in  network  sdutions  may  be  useful  in  a  variety  of  contexts.  It  is  surely 
useful  in  the  area  of  cognitive  modeling,  where  the  mere  ability  of  nets  to  simulate  psychological  j^nomena  does 
not  suffice.  It  is  also  critically  important  to  determine  whether  the  representations  d^eloped  by  networics  bear  any 
systematic  relation  to  the  rq;>resentations  employed  by  human  subjects  (McCIoskey,  1991). 

2  PRINCIPAL  COMPONENTS  ANALYSIS  OF  CONTRIBUTIONS 

In  contrast  to  Sanger’s  (1989)  three-dimensional  array  of  contributions  (output  unit  x  hidden  unit  x  input  pattern),  we 
begin  with  a  two-dimensional  output  weight  x  input  pattern  array  of  contributions.  This  is  mote  efficient  than  the 
slicing  technique  used  by  Sanger  to  focus  on  particular  output  or  hidden  units  and  yet  allows  identification  of  the 
roles  of  specif  comributions  (Shultz  &  Elman,  19S14). 

We  subject  the  correlations  among  contributions  across  input  patterns  to  PCA.  a  statistical  technique  that  identifies 
dimensions  of  variation  (Flury,  1988).  A  component  is  a  line  of  closest  fit  to  a  set  of  points  in  multi-dimensional 
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space.  PCA  summarizes  a  multivariate  data  set  in  a  few  components  by  capitalizing  on  correlations  amtHig  the 
variables. 

Here  we  apply  PCA  to  contributions  taken  from  networks  learning  either  continuous  XOR  or  arithmetic 
comparisons.  The  contribution  matrix  for  each  net  is  subjected  to  PCA  with  1.0  as  the  miniinum  eigenvalue  for 
retention.  Varimax  rotation  is  applied  to  improve  the  interpretability  of  the  solution.  Component  scores  are  plotted 
to  indicate  the  function  oi  the  components  and  component  loadings  are  examined  to  detomine  the  roles  of  particular 
contributions. 

3  APPLICATION  TO  THE  CONTINUOUS  XOR  PROBLEM 

The  classical  binary  XOR  problem  has  too  few  training  patterns  (four)  to  require  contribution  analysis.  We  construct 
a  continuous  version  of  the  XOR  problem  by  dividing  the  input  space  into  four  quadrants.  Input  values  are 
incremated  in  steps  of  0.1  starting  from  0.1,  yielding  100  x,  y  input  pairs.  Quadrant  a  has  values  of  x  less  than 
0.SS  combined  with  values  of  y  above  O.SS.  Quadrant  b  has  values  of  x  and  y  greater  than  O.SS.  Quadrant  c  has 
values  of  x  and  y  less  than  O.SS.  Quadrant  d  has  values  of  x  greater  than  O.SS  combined  with  values  of  y  below 
0.SS.  Problems  firom  quadrants  a  aitd  d  (voduce  a  positive  output  target,  whereas  problems  from  quadrants  b  and  c 
yield  a  negttive  output  target. 

Three  cascadC'CrHielation  nets  are  trained  on  continuous  XOR.  Each  net  generates  a  unique  solution,  recruiting  either 
five  or  six  hidden  units  and  taking  from  S41  to  76S  epochs.  PCA  of  unsealed  contributions  yields  three  components 
rather  ttu>n  the  two  yielded  by  PCA  of  scaled  contributions  (Shultz  &  Elman,  1994).  Plots  of  rotated  component 
scares  for  the  1(X)  training  patterns  are  less  dense  but  mrae  interesting  for  unsealed  than  for  scaled  contributions. 

Two-dimoisional  plots  of  compmient  scores  for  net  1  are  shown  in  Figure  1  and  labeled  according  to  their  reflective 
quadrant  Figure  la.  plotting  scores  on  components  1  and  3.  shows  that  component  1  reflects  the  second  input 
dimension  (quadrants  a  and  b  vs.  quadrants  c  and  d).  Figure  lb,  plotting  scores  on  components  2  and  3,  shows  ^t 
component  2  reflects  the  first  input  dimension  (quadrants  b  and  d  vs.  quadrants  a  and  c).  Both  Figures  la  and  lb 
rev^  that  component  3  separates  the  quadrants  with  a  positive  ouqwt  target  (a  and  d)  tom  those  with  a  negative 
ouqmt  target  (p  and  c).  Similar  results  were  obtained  for  the  two  other  nets.  In  contrast,  plots  of  component  scores 
for  scaled  contributions  indicated  interactive  separation  oi  the  four  quadrants,  but  with  no  clear  individual  roles  for  the 
two  components  (Shultz  Sc.  Elman.  1994). 

Figure  2  plots  the  rotated  component  scores  for  this  net  Such  plots  can  be  examined  to  determine  the  role  of  each 
contribution  in  the  networit  For  example,  input  2  and  hidden  units  1,  5,  and  6  all  participate  in  the  job  done  by 
component  1 .  namely  the  representatian  of  the  second  input  dimension. 

4  APPLICATION  TO  COMPARATIVE  ARITHMETIC 

Arithmetic  comparison  tasks  require  nets  to  compare  sums  or  products  to  some  value  and  then  ouqrut  whetho-  the 
sum  or  product  is  greater  than,  less  than,  or  equal  to  that  comparative  value.  The  fact  that  several  psychological 
simulations  using  neural  nets  involve  problems  of  linear  and  non-linear  arithmetic  operations  enhances  interest  in 
this  sort  of  problem  (McClelland.  1989;  Shultz.  Schmidt,  Buckingham,  &  Mareschal,  in  i»ess). 

Adetition  and  multiplication  tarits  each  involve  three  linear  input  units.  The  first  two  input  units  each  code  a 
randomly  selected  integer  in  the  range  tom  0  to  9,  inclusive,  and  the  third  input  unit  codes  a  randomly  selected 
comparison  integer.  For  addition  problems,  comparison  values  range  tom  0  to  19.  inclusive;  for  multiplication, 
comparison  values  range  from  0  to  82,  inclusive.  Two  output  units  code  the  results  of  the  comparison.  Target 
ouqxits  of  +~  represent  that  the  results  of  the  arithmetic  operation  are  greater  than  the  comparison  value,  targets  of  -t- 
represent  less  than,  and  targets  of  -m-  represent  equal  to.  For  (noblems  involving  both  addition  and  multiplication,  a 
fourth  input  unit  codes  the  type  of  arithmetic  operation  to  be  perfwmed:  0  for  addition,  1  for  multiplication. 

Nets  trained  on  either  addition  or  multiplication  have  1(X)  randomly  selected  training  patterns,  with  the  restriction 
that  4S  of  them  have  correct  answers  of  greater  than,  45  have  correct  answers  of  less  than,  and  10  have  correct 
answers  of  equal  to.  These  constraints  reduce  the  skew  of  comparative  values  in  the  high  direction  on  multiplication 
problems.  Nm  trained  on  both  addition  and  multiplication  receive  100  randomly  selected  addition  problems  and  100 
randomly  selected  multiplication  problems.  There  are  three  addition  nets,  three  multiplication  nets,  and  three  nets 
trained  on  both  addition  and  multipiicatkm. 

4.1  ADDITION  RESULTS 
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Each  of  the  three  nets  learning  addition  problems  recruited  a  single  hidden  unit.  They  took  between  ISS  and  169 
epochs  to  learn.  PCA  of  unsealed  contributions  in  each  net  yields  three  significant  components,  unlike  the  two 
components  obtained  with  scal^  contributions. 

Component  sc(»e  plots,  such  as  that  for  net  1  in  Figure  3,  indicate  that  component  1  distinguishes  less  than  from 
greater  than  answers.  Problems  with  equal  to  answers  were  not  isolated  by  the  three  components.  Components  2  and 
3  are  particularly  sensitive  to  variation  in  the  size  of  the  first  and  second  integers  to  be  added,  respectively.  This  was 
revealed  by  examining  extreme  component  scores  on  these  components,  either  greater  than  1.0  or  less  than  -1.0. 
Problems  with  extremely  negative  component  2  scores  had  a  mean  of  8.41  for  the  first  integer  and  536  for  the 
second  integer.  Problems  with  extremely  positive  component  2  scores  had  a  mean  of  1.00  for  the  first  integer  and 
S.S2  for  the  second  integer.  This  indicates  that  component  2  is  primarily  sensitive  to  the  size  of  the  first  integer 
input.  In  contrast,  component  3  was  sensitive  to  the  size  of  the  second  integer  input  with  means  of  1.48  for 
extremely  negative  component  scores  and  8.36  for  extremely  positive  component  scores.  The  means  on  the  first 
intego’  input  did  not  vary  much  with  extremity  of  component  3  score:  4.70  vs.  4.0S.  Similar  results  obtained  for  the 
other  two  nets. 

PCA  of  scaled  contributions  had  produced  two  components  that  were  sensitive  only  to  answer  type  and  not  to 
variation  in  integer  input.  As  with  the  continuous  XOR  problem,  the  plots  of  component  scores  were  denser  for 
scaled  contributions,  but  not  as  revealing  (Shultz  &  Elman,  1994). 

4.2  MULTIPLICATION  RESULTS 

Multiplication  is  a  much  more  difficult  problem  for  nets  with  additive  activation  functions,  as  revealed  by  the  fact 
that  tte  nets  learning  multiplication  comparisons  required  from  832  to  over  1000  epochs  and  recruited  between  six 
and  eight  hidden  units.  Runs  were  terminated  when  they  reached  1000  epochs.  PCA  applied  to  the  contributions  in 
these  nets  yields  from  4  to  6  significant  components.  Plots  of  rotat^  component  scores  for  two  of  the  four 
components  from  net  3  are  fnesented  in  Figure  4.  This  plot  shows  that  most  of  the  separation  of  greater  than  from 
less  than  ouQiuts  was  accomplished  by  component  2.  Component  1  served  to  make  this  distinction  for  the  remaining 
l»oblems.  Problems  with  equal  to  answers  were  not  isolated  by  any  of  the  four  components. 

Component  1  also  served  to  represent  variation  in  the  second  input.  Problems  with  extremely  high  scores  on 
component  1  have  a  mean  second  input  of  8.S7;  those  with  extremely  low  scores  on  component  1  have  a  mean 
second  input  of  0.S6.  Component  3  serves  a  similar  role  for  the  first  input  Problems  with  extremely  high  scores  on 
component  3  have  a  mean  first  input  of  8.1 1;  those  with  extremely  low  scores  on  component  3  have  a  mean  first 
input  of  1.10.  The  role  of  component  4  is  opaque.  Basically  similar  results  were  obtained  for  the  other  two 
multiplication  nets.  In  contrast,  PCA  of  scaled  components  were  less  revealing,  except  for  offering  a  clear  separation 
of  answer  types  (Shultz  &  Elman,  1994). 

4.3  RESULTS  FOR  NETS  DOING  BOTH  ADDITION  AND  MULTIPLICATION 

Learning  to  do  both  addition  and  multiplication  is  even  more  difficult  than  multiplication  alone.  None  of  the  three 
nets  quite  reached  victory  by  1000  epochs,  but  each  did  come  close.  Either  seven  or  eight  hiddens  units  were 
recruited.  PCA  of  contributions  yields  five  components  in  each  of  the  three  nets.  Besides  the  familiar  distinctions 
between  problem  types  and  variation  in  integer  inputs  found  in  nets  doing  either  addition  or  multiplication,  it  is  of 
interest  to  determine  whether  nets  doing  both  operations  distinguirii  between  adding  and  multiplying. 

Figure  S  shows  rotated  component  scores  for  three  components  from  net  3.  Component  1  separates  greater  than  from 
less  than  answers.  Component  5  and,  to  a  lesser  extent,  component  4  separate  ^ding  from  multiplying.  The  role  of 
component  4  is  not  very  clear  from  Figure  3,  but  various  two-dimensional  plots  of  component  4  reveal  that  it 
separates  adding  vs.  multiplying  for  problems  with  less  than  answers. 

Components  2  and  3  handle  variation  in  the  first  and  second  input  integers,  respectively.  Problems  with  extremely 
positive  component  2  scores  have  a  mean  first  input  integer  of  8.53;  problems  with  extremely  negative  component  2 
scores  have  a  mean  first  input  integer  of  0.84.  Fleblems  with  extremely  positive  component  3  scores  have  a  mean 
second  input  integer  of  8.SS;  problems  with  extremely  negative  component  3  scores  have  a  mean  second  input 
integer  of  1.05.  Problems  with  equal  to  answers  are  not  isolated  by  any  of  the  components.  Results  for  the  other  two 
nets  learning  both  multiplication  and  addition  comparisons  are  essentially  similar  to  these.  In  contrast,  PCA  of 
scaled  contributions  had  produced  three  components  that  interactively  separate  the  three  answer  types  and  operations, 
but  did  not  represent  variation  in  input  integers  (Shultz  &  Elman,  1994). 

5  DISCUSSION 
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As  with  continuous  XOR,  there  is  considerable  variation  among  networks  learning  comparative  arithmetic  problems. 
Yet  with  ail  of  this  variation,  it  is  apparent  that  the  nets  learn  to  separate  arithmetic  problems  acovding  to  features 
afforded  by  the  training  set  Nets  learning  either  addition  or  multiplication  differentiate  the  problems  accwding  to 
answer  types  and  nets  learning  both  arithmetic  operations  supplement  these  answer  distinctions  with  the  (^rational 
distinction  between  adding  and  multiplying.  Variation  along  the  integer  input  dimensions  is  also  well  rq>resented. 

This  research  confirms  earlier  conclusions  that  PCA  of  network  contributions  is  a  useful  technique  fw  undostanding 
the  performance  of  netwoits  constructed  by  the  cascade-correlation  learning  algorithm  (Shultz  &  Elnum,  1994). 
Beciuise  cascade-cmrelation  nets  typically  possess  multiple  hidden  layers  and  are  fully  cross  connected,  they  are 
difficult  to  analyze  with  more  standard  methods  emphasizing  activation  patterns  on  the  hidden  units  alone. 
Examiiution  of  their  weight  patterns  is  also  problematic,  particularly  in  larger  networks,  because  of  the  highly 
distributed  nature  of  the  net's  representations. 

Analyzing  contributions,  in  contrast  to  either  hidden  unit  activations  or  weights,  is  a  naturally  rqrpealing  solution. 
Contributions  capture  the  influence  coming  into  ouqrut  units  both  from  adjacent  hidden  units  and  fi^  distant,  cross 
connected  hidden  and  input  units. 

The  present  woric  also  suggests  that  analyzing  unsealed  contributions  yields  more  useful  results  than  does  the 
analysis  of  contributions  that  are  scaled  by  the  output  targets.  This  is  p^cularly  true  in  terms  of  sensitivity  to 
various  input  dimensions  and  to  operational  distinctions  between  adding  and  multiplying.  Plots  of  component  sc(»es 
based  on  unsealed  contributions  are  typically  not  as  dense  as  those  based  (xi  scaled  contributions  but  seem  to  be  more 
revealing  of  what  information  the  netwoilc  is  representing.  Including  target  ouqruts  in  these  analyses  is  not  only 
unrealistic,  but  also  obscures  at  least  part  of  what  networks  represent,  such  as  variation  along  important  input 
dimensions.  A  drawback  of  using  unsealed  contributions  is  that  contributions  from  the  bias  unit  are  ignored  for  lack 
of  variation.  This  may  explain  why  the  present  analyses  fail  to  isolate  arithmetic  problems  with  equal  to  outcomes. 

Because  PCA  of  contributions  can  identify  the  role  of  contributions  from  particular  hidden  units,  it  should  be  useful 
in  (nedicting  the  results  of  lesioning  experiments  with  neural  nets.  Once  the  role  of  a  hidden  unit  has  been  identified 
by  its  association  with  a  particular  principal  component,  then  it  could  be  predicted  that  lesioning  this  unit  would 
impair  whatever  function  is  served  by  the  component  PCA  of  network  contributions  obtained  from  cognitive 
modeling  could  also  be  a  useful  source  of  psychological  hypotheses. 
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Figure  2.  Component  loadings  for  a  continuous  XOR  net 
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Abstract 

A  learning  algorithm,  referred  to  as  concurrent  training,  based  on  genetic  algorithms  for 
a  neural  network  with  connected  modules  is  described.  The  algorithm  does  not  require  the 
knowledge  of  training  sets  for  each  module  so  that  all  modules  can  be  trained  concurrently. 
For  an  iV-module  system,  N  separate  pools  of  chromosomes  are  maintained  and  updated. 
The  concurrent  training  algorithm  is  applied  to  train  multilayered  feedforward  networks  by 
considering  each  layer  of  connections  to  be  a  1-layer  network  module.  The  algorithm  is  tested 
using  the  4-bit  parity  problem  and  a  linearly  nonseparable  classification  problem.  Experiment 
results  are  presented  and  the  learning  behavior  and  performance  is  analyzed. 


1  Introduction 

Supervised  learning  of  feedforward  networks  is  typically  in  the  form  of  a  search  in  the  weight 
space  for  a  set  of  weights  which  minimizes  the  difference  between  the  computed  and  the  target 
outputs  for  a  given  input.  The  optimization  method  used  most  often  in  this  role  is  the  gradient 
descent  search,  such  as  in  the  variations  of  the  backpropagation  learning  algorithm.  After 
the  network  has  completed  the  learning  phase,  faults  in  the  components  of  the  network  may 
lead  to  incorrect  output  being  computed.  In  this  paper,  the  design  of  fault-tolerant  feedforward 
networks  is  considered  by  incorporating  a  measure  of  fault-tolerance  in  the  optimization  criterion 
during  learning.  Since  the  optimization  function  will  not  be  convex,  the  use  of  an  alternative 
optimization  method  known  as  genetic  algorithms  to  train  multilayered  feedforward  networks 
in  the  supervised  learning  mode  is  described. 

Previous  work  in  designing  fault-tolerant  neural  networks  have  included:  including  faults 
during  training  [1],  min-max  fault-tolerance  learning  [2],  fault-tolerance  through  weight  control, 
and  fault-tolerance  through  strict  learning  and  strict  operation  [3].  The  method  used  in  the 
present  paper  explicitly  separates  the  classification  error  from  the  errors  due  to  faults  in  the 
optimization  function.  Let  {(^'*,  r") :  ^  =  0, 1,  •  •  • ,  -  1}  be  the  training  set,  where  and 

denote  the  /ith  input  and  target  output,  respectively.  Let  C'*  denote  the  output  of  a  network  in 


response  to  the  input  pattern  The  mean-squared  classification  error 

En  =  ElC'  - 

I* 

is  typically  the  measure  to  be  minimized  in  a  training  algorithm.  Suppose  the  network  has  Af 
hidden  units;  let  rn  =  0,-  •  ',M  -  l,he  the  network  output  in  response  to  when  the  mth 
hidden  unit  is  faulty.  The  error  due  to  faults  is  defined  to  be: 

The  overall  optimization  criterion  in  our  training  algorithm  is  then: 

E  =  En  +  ^Ef, 

where  A  is  a  scalar  constant  and  is  set  at  0.4  in  our  experiments.  It  is  noted  that  in  this  paper 
the  attention  is  restricted  to  faults  occurring  in  the  hidden  units;  nevertheless,  the  concepts  can 
be  generalized  to  failures  in  the  links  or  units  in  other  layers  by  extending  Ef  to  cover  all  faults. 

Genetic  algorithms  (GAs)  are  stochastic  optimization  algorithms  [4]  in  which  a  solution 
to  the  combinatorial  problem  of  interest  is  represented  by  a  binary  string,  referred  to  as  a 
chromosome.  A  fitness  value  is  defined  for  each  chromosome  based  on  the  cost  associated  with 
the  corresponding  solution.  A  population  of  chromosomes  is  maintained,  and  a  new  generation 
is  formed  by  selecting  mating  pairs  with  superior  fitness  values.  Genetic  operators,  such  as 
crossover  or  mutation,  are  applied  to  the  mating  pairs  to  form  offsprings,  which  would  be 
improved,  or  better  fit,  solutions.  This  process  is  repeated  until  an  acceptable  solution  appears 
in  one  of  the  generations. 

Genetic  algorithms  have  been  used  to  train  multilayered  neural  networks.  In  some  ap¬ 
proaches  (e.g.,  [5]),  the  network  architecture  is  fixed  and  the  network  weights  are  encoded  as  a 
chromosome  with  which  a  GA  is  used  to  search  for  the  optimal  weights.  In  others  (e.g.,  [6]),  a 
GA  is  used  to  assist  some  other  training  techniques  such  as  back-propagation  by  defining  the 
network  architecture,  by  finding  the  initial  weights  for  back-propagation,  or  parameters  used  in 
other  training  methods.  In  this  approach,  GA  is  used  to  augment  the  main  training  method  by 
finding  a  set  of  favorable  constraint  domains.  In  [7],  a  GA  is  used  to  train  a  neural  network  and 
to  construct  the  network  architecture  simultaneously.  In  [8],  a  GA  is  used  to  train  a  large  scale 
neural  network  system  by  training  each  component  subnetwork  or  module  sepau’ately  provided 
that  the  training  sets  are  available  for  all  modules. 

The  concurrent  training  algorithm  used  in  this  project  sets  the  synaptic  weights  using  a 
genetic  algorithm  search,  in  which  multiple  strands  of  chromosomes  are  used  to  encode  a  phe¬ 
notype  [9].  A  neural  network  that  consists  of  N  layers  could  be  encoded  as  N  chromosomes. 
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2  Experiments  and  Results 


The  concurrent  training  of  fault-tolerant  neural  networks  is  validated  by  using  the  4-bit  parity 
problem,  so  that  there  are  a  total  of  16  input-target  pattern  pairs.  While  all  inputs  are  encoded 
using  binary  digits  1  and  -1,  the  weights,  and  hence  the  chromosomes,  are  encoded  with  binary 
digits  0  and  1. 

The  experiments  for  training  fault-tolerant  neural  networks  were  conducted  by  setting  A  = 
0.4  in  the  optimization  function.  As  a  control,  the  experiments  were  repeated  with  the  same  GA 
parameters  while  setting  A  =  0.0  in  the  optimization  function,  thus  removing  the  fault- tolerance 
inclusion  property  of  the  training  algorithm.  The  network  to  be  trained  in  each  experiment  set 
are  2-layer  fully  connected  networks,  each  with  four  input  units  and  one  output  unit.  Different 
numbers  of  hidden  units  are  considered  (viz.  4,  6,  8,  and  10),  to  observe  the  effect  of  increasing 
hidden  units  on  fault- tolerance.  Because  the  training  algorithm  is  stochastic  in  nature,  seven 
runs  were  made  for  each  parameter  setting,  with  the  number  of  iterations  set  at  a  constant  200 
generations  for  each  run. 

The  results  of  the  experiments  are  summarized  in  Tables  1  and  2.  in  which  the  number  of 
generations  it  took  for  the  best  network  to  evolve,  the  number  of  classification  error  and  the 
number  of  errors  due  to  faults  of  the  best  network,  are  tabulated  against  the  number  of  hidden 
units  in  the  network.  In  Tables  1  and  2,  the  minimum  and  the  average  of  the  number  of  errors 
in  the  seven  runs,  respectively,  are  shown.  In  both  tables,  part  (a)  refers  to  the  case  where 
fault- tolerance  is  included  (A  =  0.4)  while  part  (b)  refers  to  the  case  when  A  =  0,  so  that  the 
number  of  errors  due  to  faults  of  the  network  is  not  used  in  the  training. 

Two  factors  are  of  interest  here:  the  learning  behavior  and  the  effectiveness  of  training 
with  fault-tolerance  inclusion.  The  learning  behavior  is  first  considered.  When  fault-tolerance 
is  not  used  in  the  training,  the  number  of  generations  required  to  obtain  an  optimal  network 
decreased  rather  rapidly  from  about  60  to  32  as  the  number  of  hidden  units  was  increased  from 
4  to  10,  as  can  be  expected.  When  fault-tolerance  is  included  in  training,  however,  the  number 
of  generations  required  to  obtain  an  optimal  network  increased  gradually  from  about  90  to 
110  as  the  number  of  hidden  units  was  increased.  More  time  is  needed  because  more  training 
constraints  were  imposed,  although  it  was  observed  that  some  runs  converged  early  and  some 
very  late. 

Next,  the  effectiveness  of  fault-tolerance  inclusion  training  is  studied.  Consider  a  network 
with  4  hidden  units,  without  including  fault-tolerance  in  the  training,  it  makes  on  average  5.43 
(out  of  16)  output  errors  with  a  single  fault  in  the  network.  Increasing  the  number  of  hidden 
units  did  not  produce  more  fault-tolerant  networks,  as  shown  in  Tables  1(b)  and  2(b).  With 
fault- tolerance  included  in  the  training,  a  network  with  4  hidden  units  makes  on  average  2.86 
output  errors.  The  number  of  errors  due  to  fault  decreases  to  an  average  of  0.14  when  10  hidden 
units  are  used. 

The  capacity  of  the  neural  networks  trained  with  single  fault-tolerance  inclusion  to  handle 


m-698 


r 


multiple  faults  is  further  tested.  Networks  with  10  hidden  units  are  trained,  again  with  A  =  0.4 
for  testing  and  with  A  =  0.0  for  control.  One  to  five  hidden  units  were  set  to  fail  in  these 
experiments.  There  are  Cj^  combinations  of  having  n  faults,  n  =  1,  •  •  • ,  5,  out  of  10  hidden  units; 
all  combinations  were  tested  and  the  maximum,  average,  and  minimum  errors  of  all  combinations 
were  recorded.  The  experiments  were  repeated  for  seven  times,  and  the  corresponding  maximum, 
average,  and  minimum  errors  were  averaged  over  these  runs. 

The  results  are  shown  in  Table  3,  where  each  row  contains  averaged  results  from  seven  runs. 
It  can  be  seen  that  networks  trained  with  fault  tolersmce  inclusion  have  less  maximum  errors  than 
networks  trained  without  fault  tolerance  inclusion;  i.e.,  such  fault  tolerance  networks  perform 
better  in  the  worst  case  scenario  in  which  serveral  “importamt”  hidden  unit  fail.  Their  overall 
average  performances  are  also  better  than  their  counterparts.  As  can  be  expected,  networks 
trained  with  fault  tolerance  inclusion  do  not  handle  multiple  faults  as  effective  as  single  fault. 
Multiple  faults  are  handled  in  both  cases  by  sheer  redundancy.  This  could  be  attributed  to  the 
nature  of  the  error  functions  defined  for  the  training. 


3  Concluding  Remarks 


A  training  algorithm  is  presented  which  includes  a  fault-tolerance  component  as  part  of  the 
optimization  criterion.  Since  the  combined  error  function  is  not  convex,  a  genetic  algorithm  is 
used  to  search  for  the  optimal  weights.  The  representation  of  the  network  in  a  genetic  algorithm 
is  considered,  and  a  scheme  where  different  layers  of  the  networks  are  distributed  on  different 
chromosome  strands  is  proposed  and  analyzed.  Experiment  results  are  used  to  show  the  learning 
behavior  as  well  as  the  effectiveness  of  the  new  training  algorithm  to  produce  networks  that  can 
handle  single  as  well  as  multiple  faults. 
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Table  1.  Best  performance  of  concurrent  training  attained  in  seven  runs. 


Number  c/ 
hidden  units 

Number  of 
generations 

Number  of  class- 
fication  errors 

Number  of 
fault  errors 

4 

0.00 

3.00 

6 

0.00 

1.00 

8 

106.00 

0.00 

0.00 

10 

68.00 

0.00 

0.00 

(a)  With  fault-tolerance  inclusion  in  learning  (A  =  0.4). 


Number  of 
hidden  units 

Number  of 
generations 

Number  of  class- 
fication  errors 

Number  of 
fault  errors 

4 

98.00 

0.00 

5.00 

6 

92.00 

0.00 

3.00 

8 

22.00 

0.00 

3.00 

10 

24.00 

0.00 

3.00 

(b)  Without  fault-tolerance  inclusion  in  learning  (A  =  0.0). 
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Table  2.  Average  performance  of  concurrent  training  attained  in  seven  runs. 


Number  of 
hidden  units 

Number  of 
generations 

Number  of 
fault  errors 

4 

92.29 

1.71 

6 

102.14 

0.14 

8 

103.43 

0.00 

0.57 

10 

107.29 

0.00 

0.14 

(a)  With  fault-tolerance  inclusion  in  learning  (A  =  0.4). 


Number  of 
hidden  units 

Number  of 
generations 

RTjC! 

Number  of 
fault  errors 

4 

61.43 

1.14 

5.43 

6 

44.71 

0.00 

4.43 

8 

45.14 

0.00 

4.57 

10 

32.86 

0.00 

4.43 

(b)  Without  fault-tolerance  inclusion  in  learning  (A  =  0.0). 


Table  3.  Average  of  the  performance  of  concurrent  training  in  seven  runs  with  multiple  faults. 


Number  of 
Faulty  Units 

Number  of 
Combinations 

Average  of 
Maximum  Errors 

Average  of 
Minimum  Erro  rs 

Avera  ge  of 
Average  Errors 

0.14 

0.00 

0.11 

4.71 

0.00 

1.38 

5.57 

0.14 

2.09 

6.43 

0.14 

2.82 

5 

7.14 

0.14 

2.99 

(a)  With  fault-tolerance  inclusion  in  learning  (A  =  0.4). 


Number  of 
Faulty  Units 

Number  of 
Combinations 

Average  of 
Maximum  Errors 

Average  of 
Minimum  Erro  rs 

Avera  ge  of 
Average  Errors 

1 

4.43 

1.79 

2 

45 

5.57 

2.64 

3 

7.00 

3.33 

4 

7.57 

3.89 

5 

252 

8.14 

4.45 

(b)  Without  fault-tolerance  inclusion  in  learning  (A  =  0.0). 
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Abstract: 

In  this  paper,  the  characteristics  of  the  learning  rule  in  the  “Complex-BP” ,  a  complex 
numbered  version  of  the  back-propagation  algorithm,  are  investigated.  The  results  of 
this  study  may  be  summarized  as  follows;  (a)  the  error  back  propagation  has  a  structure 
which  is  concerned  with  two-dimensional  motion,  (b)  the  unit  of  learning  is  complex¬ 
valued  signals  flowing  in  neural  networks,  (c)  the  learning  rule  is  structured  to  avoid  a 
‘^standstill  in  learning”.  Ultimately,  learning  speed  is  improved.  In  addition,  the  number 
of  parameters  needed  is  only  about  half  that  of  the  standard  BP. 


1  Introduction 

The  purpose  of  this  paper  is  to  investigate  the  characteristics  of  the  learning  rule  in  the 
complex-valued  version  of  the  back-propagation  algorithm  “Complex-BP"  [2,  3].  We  have 
obtained  the  following  results  on  the  inherent  properties  of  the  Complex-BP  algorithm, 
(a)  The  error  back  propagation  has  a  structure  which  is  concerned  with  two-dimensional 
motion,  (b)  The  unit  of  learning  is  complex-valued  signals  flowing  in  neural  networks, 
(c)  The  learning  rule  is  structured  to  avoid  a  “standstill  in  learning”.  Ultimately,  the 
average  convergence  speed  is  improved.  In  addition,  the  required  number  of  weights  and 
thresholds  (called  “learning  parameters”  here)  is  only  about  half  that  of  the  standard  back- 
propagation  algorithm  or  “Real-BP”  [5].  Thus  it  seems  that  the  Complex-BP  algorithm 
is  well  suited  for  learning  complex-valued  patterns. 

2  The  ‘‘Complex-BP”  Algorithm 

This  section  will  briefly  describe  the  Complex-BP  algorithm  [2,  3].  It  can  be  applied  to 
multi-layered  neural  networks  in  which  weights,  threshold  values,  input  and  output  signals 
are  all  complex  numbers,  and  the  output  function  fc  of  a  neuron  is  deflned  as 

fc{z)  =  /r(x)  -I-  ifniy),  (1) 

where  z  =  x  +  iy,  i  denotes  y/^  and  //i(u)  =  1/(1  -f  exp(— «)),  that  is,  the  real  and 
imaginary  p2U'ts  of  the  output  of  a  neuron  refer  to  the  sigmoid  functions  of  the  real  part 
X  and  the  imaginary  part  y  of  the  net  input  z  to  a  neuron,  respectively.  The  learning  rule 
was  obtained  using  a  steepest  descent  method. 

Note  that  there  is  another  formulation  of  a  complex- valued  version  [1]  in  which  the 
output  function  is  a  complex- valued  function  fci^)  =  1/(1  +oxp(— z)),  where  z  =  X'\-iy. 


3  Characteristics  of  Learning 

In  this  section,  the  characteristics  of  learning  in  the  Complex-BP  algorithm  are  discussed. 
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3.1  Structure  of  Learning  Rule 

First  of  all,  we  investigated  the  structure  of  the  learning  rule  in  the  Complex- BP  algorithm, 
using  the  three-layer^  (complex- valued)  neural  network  described  below  as  an  example. 
We  used  w„i  for  the  weight  between  the  input  neuron  /  and  the  hidden  neuron  m,  Vnm 
for  the  weight  between  ^e  hidden  neuron  m  and  the  output  neuron  n,  0^  for  the 
threshold  of  the  hidden  neuron  m,  and  for  the  threshold  of  the  output  neuron  n.  We 
let  /{,  Hm,  On  denote  the  output  values  of  the  input  neuron  /,  the  hidden  neuron  m, 
and  the  output  neuron  n.  We  also  let  =  T„  —  0„  be  the  error  between  the  output  value 
On  of  the  output  neuron  n  and  the  desired  output  value  T„  for  the  output  neuron  n. 

Let  Ax^,  Ax'  be  the  real  and  imaginary  parts  of  the  magnitude  of  change  of  a  learning 
parameter  x,  respectively;  i.e.,  Ax'*  =  ^c[Ax),  \x'  =  /m[Ax],  where  -Welz],  Im\z] 

denote  the  real  and  imaginary  parts  of  a  complex  number  z,  respectively.  Then,  tne 
learning  rule  can  be  expressed  as: 


(2) 

(3) 

(4) 


(5) 


where  =  (l-/icJOJ)/?c[Onl,  =  (l-/m[0„])/m[0„I,  (7„  =  (l-/2e(^, 
-)/mfkj,  =  arctan{/m[A„]//?e[>/,„I),  4>i  = 


rmKm]/ 
),  l^ml  r* 


arctaui(/m[//)/i2e[/J 

Re[vnm])> 

refers  to  a  similar  transformation  (reduction,  magnification)  of 

sin/9„ 


D„,  = 

and  ipnm  =  arctan(/m 
In  expression  (2),  |. 

the  distance  between  a  point  and  the  origin  in  the  Euclidean  plane,  and  cos  fi 

a  clockwise  rotation  of  a  point  by  degrees  about  the  origin.  Thus,  the  linear  trans¬ 
formation  called  two-dimensional  motion  is  performed  in  equation  (2).  Hence,  we  find 
that  the  magnitude  of  change  in  the  weight  between  the  hidden  and  output  neurons 
can  be  obtained  via  the  above  linear  transformation  (two-dimensional  mo¬ 
tion)  of  (  A7„  ,  A7')  which  is  the  magnitude  of  change  in  the  threshold  of  the  output  neu¬ 
ron.  Similarly,  the  magnitude  of  change  in  the  threshold  of  the  hidden  neuron  (Afl*,  A0j[,) 
can  be  obtained  by  applying  the  two-dimensional  motion  concerning  Vnm  (the  weight  be¬ 
tween  the  hidden  and  output  neurons)  to  ( A7j^,  A7')  which  is  the  magnitude  of  change 
in  the  threshold  of  the  output  neuron  (equation  (5)).  Finally,  ( Ato^;,  Au>^,)  can  be  ob¬ 
tained  by  applying  the  two-dimensional  motion  concerning  It  to  (A0^,  ^0^)  (equation 
U)),  Thus,  it  seems  to  be  quite  reasonable  to  assume  that  the  error  propagation  in  the 
&mplex-BP  has  a  structure  based  on  two-dimensional  motion. 

The  two-dimensional  structure  of  the  error  propagation  described  above  makes  its 
appearance  as  the  following  mechanism:  the  unit  of  learning  in  the  Complex-BP  algorithm 
is  complex- valued  signals  flowing  in  neural  networks.  For  example,  and  Aw' 

comprise  both  the  real  part  (flc(i/ml,  He(0„])  and  the  imaginary  part  (/m(Wm],  /m[0„]) 
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of  complex-valued  signals  {Hm,  0„)  flowing  in  neural  networks,  respectively  (equation 
(2)).  That  is,  there  is  a  relation  through  (flcp/„),  /?c(0„])  and  (/m[/fTO],  /m[0„])  between 
and  Av^„^.  Similarly,  there  are  relations  between  and  (equation  (4)), 

and  between  A0^  and  A0^  (equation  (5)).  Therefore,  in  the  Complex-BP  algorithm,  the 
real  and  imaginary  parts  ot  learning  parameters  are  modified,  based  on  both  the  real  and 
imaginary  parts  of  complex- valued  signals  flowing  in  neural  networks,  respectively.  From 
these  facts,  we  may  conclude  that  “complex-valued  signals"  flowing  in  neuraJ  networks 
are  a  unit  of  learning  in  the  Complex-BP  algorithm. 


3.2  Improving  Learning  Speed 

As  we  have  seen  in  the  previous  subsection,  the  error  propagation  of  the  Compi>!x-BP 
algorithm  has  a  structure  based  on  two-dimensional  motion,  which  also  means  that  the 
unit  of  learning  is  complex- valued  signals  flowing  in  neural  networks.  Furthermore,  we 
will  find  in  this  subsection  that  this  structure  improves  learning  speed. 

The  derivative  (1  —  /r(«))  /«(«)  of  the  sigmoid  function  /h(u),  which  is  the  output 
function  of  each  neuron,  appears  in  the  learning  rule  of  the  Real-BP.  The  value  of  the 
derivative  asymptotically  approaches  0  as  the  absolute  value  of  the  net  input  u  to  a 
neuron  increases.  Hence,  as  1«|  increases  to  make  the  output  value  of  a  neuron  exactly 
approach  0.0  or  1.0,  the  derivative  (1  —  //i(«))  /«(«)  shows  a  small  value,  which  causes 
what  is  called  a  standstill  in  learning.  This  phenomenon  is  called  “getting  stuck  in  a  local 
minimum"  if  it  continuously  takes  place  for  a  considerable  length  of  time,  and  the  error 
between  the  actual  and  desired  output  values  remains  large.  As  is  generally  known,  this 
is  the  mechanism  of  standstill  in  learning  in  the  standard  back- propagation  algorithm. 

On  the  other  hand,  two  kinds  of  derivatives  of  the  sigmoid  function  appear  in  the 
learning  rule  of  the  Complex-BP  algorithm  (equations  (2)-(5^:  one  is  the  derivative  of 
the  real  part  of  an  output  function  ((1  —  ^c[0„])^c(0„],  (1  —  Ke[Hm])Re\Hm])-,  the  other 
is  that  of  the  imaginary  part  ((1  —  /m(0J)/m[0„],  (1  —  The  learning 

rule  of  the  Complex-BP  algorithm  basically  consists  of  two  linear  combinations  of  them: 

ax(l-Re[0„])Re[0„]  -f-  )3,(l-/m(0„])/m[0„],  (6) 

(7) 

where  €  R  (fc=l,2),  R  denotes  the  set  of  real  numbers.  Note  that  expression  (6) 

shows  a  very  small  value  when  both  (1  —  /2e[0„])Re[0,,]  and  (1  —  /m[0J)/m[0„)  are  very 
small  values.  Hence,  there  is  a  possibility  that  expression  (6)  does  not  show  an  extremely 
small  value  even  if  (1  —  Re[0„])i2e[0„]  is  very  small,  because  (1  —  /m[0„])/m[0„]  is  not 
always  small  in  the  Complex-BP  algorithm  (whereas  the  magnitude  of  learning  parameter 
updates  inevitably  becomes  quite  small,  if  (1  —  /«(«))//?(«)  is  quite  small  in  the  Real- 
BP  aJgorithm).  In  this  sense,  the  real  factor  ((1  —  /?e[0„])Re[0„],  (1  —  Re\Hm\)Re[Hm]) 
makes  up  for  the  imaginary  factor  ((1  —  /m{0j)/m[0„],  (1  —  showing 

an  abnormally  small  value,  and  vice  versa.  Thus,  compared  with  the  updating  rule  of 
the  Real-BP,  that  of  the  Complex-BP  has  a  structure  that  reduces  the  probability  for 
standstill  in  learning  to  occur.  This  indicates  that  the  learning  speed  of  the  Complex-BP 
is  faster  than  that  of  the  Real-BP.  This  will  be  confirmed  by  computational  experiments 
on  complex- valued  patterns  in  the  following  subsection. 


3.3  Learning  Speed 

In  this  subsection,  the  learning  speed  of  the  Complex-BP  algorithm  is  studied  for  some 
simulated  examples  on  complex- valued  patterns. 
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In  general,  learning  speed  should  be  examined  from  the  perspective  of  computational 
complexity  (time  suid  space  complexities).  We  assume  here  that  “time  complexity”  means 
the  sum  of  four  operations  for  real  numl^rs,  and  “space  complexity”  the  sum  of  learning 
parameters  (weights  and  thresholds). 

The  average  of  learning  cycles  needed  for  convergence  by  the  Complex-BP  was  com¬ 
pared  with  that  of  the  conventional  back- propagation  technique,  using  the  neural  networks 
in  which  the  time  complexities  per  learning  cycle  of  two  techniques  were  almost  equal.  In 
addition,  the  space  complexity  was  also  examined. 

In  the  experiments,  the  initial  real  and  imaginary  components  of  the  weights  and 
the  thresholds  were  chosen  to  be  random  real  numbers  between  —  0.3  and  -|-  0.3.  We 
determined  that  learning  finished  when 


0.10 


(8) 


held,  where  Tt** ,  €  C  denote  the  desired  output  value,  the  actual  output  value  of 

the  neuron  k  for  the  pattern  p,  i.e.  the  left  side  of  equation  (8)  denotes  the  error  between 
the  desired  and  actual  output  patterns,  C  denotes  the  set  of  complex  numbers;  N  denotes 
the  number  of  neurons  in  the  output  layer.  We  regarded  the  presentation  of  one  set  of 
learning  patterns  to  the  neural  network  as  one  learning  cycle. 


Experiment  1 

First,  a  set  of  simple  (complex- valued)  learning  patterns  shown  in  Table  1  was  used 
to  compare  the  performance  of  the  Complex-BP  algorithm  with  that  of  the  standard 
back-propagation  algorithm.  We  used  a  1-3-1  three-layered  network  for  the  Complex-BP, 
2uid  a  2-7-2  three-layered  network  for  the  standard  BP.  Table  2  shows  that  their  time 
complexities  per  learning  cycle  were  almost  equal. 

In  the  experiment  for  the  Real-BP,  the  real  component  of  a  complex  number  was  input 
into  the  first  input  neuron,  and  the  imaginary  component  was  input  into  the  second  input 
neuron.  The  output  from  the  first  output  neuron  was  interpreted  to  be  the  real  component 
of  a  complex  number;  the  output  from  the  second  output  neuron  wm  interpreted  to  be 
the  imaginary  component. 

The  average  number  of  iterations  required  for  convergence  of  50  trials  for  each  of  6 
learning  rates  (0.1,  0.2,  •  •  • ,  0.6)  was  adopted  as  a  criterion  of  the  evaluation.  Although 
we  stopped  learning  at  the  50,000th  iteration,  all  trials  succeeded  in  converging.  The 
result  of  the  experiments  is  shown  in  Fig.  1 . 


Experiment  2 

Next,  we  carried  out  an  experiment  using  the  set  of  (complex- valued)  learning  patterns 
shown  in  Table  3.  The  learning  patterns  were  defined  according  to  the  following  two 
rules:-  (a)  the  real  part  of  “Complex  number  3”  (output)  is  1  if  “Complex  number  1” 
(input)  is  equal  to  “Complex  number  2”  (input),  otherwise  it  is  0;  (b)  the  imaginary  part 
of  “Complex  number  3”  is  1  if  “Complex  number  2”  is  equal  to  either  1  or  i,  otherwise  it 
is  0. 

The  experimental  task  was  the  same  as  in  Experiment  1  except  for  the  layered  network 
structure:  a  2-4-1  three-layered  network  was  used  for  the  Complex-BP  method  while  a 
4-9-2  three-layered  network  was  used  for  the  standard  method.  Table  4  shows  that  their 
time  complexities  per  learning  cycle  were  equal. 

In  the  experiment  for  the  Real-BP,  the  real  and  imaginary  components  of  “Complex 
number  1”  and  the  real  and  imaginary  components  of  “Complex  number  2”  were  input 
into  the  first,  second,  third  and  fourth  input  neurons,  respectively.  The  output  from  the 
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first  output  neuron  was  interpreted  to  be  the  real  component  of  a  complex  number;  the 
output  from  the  second  output  neuron  was  interpreted  to  be  the  imaginary  component. 

We  stopped  learning  at  the  100,000th  iteration.  The  results  of  the  experiments  are 
shown  in  Fig.  2.  For  reference,  we  show  the  rate  of  convergence  in  Table  5. 

We  can  conclude  from  these  experiments  that  the  CompIex-BP  exhibits  the  following 
characteristics  in  learning  complex-valued  patterns  the  learning  speed  is  several  times 
faster  than  that  of  the  conventional  technique  (Figs.  1  and  2),  while  the  space  complexity 
(i.e.  the  number  of  learning  parameters)  is  only  about  half  that  of  the  standard  BP 
(Tables  2  and  4). 

We  can  assume  that  the  structure  of  reducing  “standstill  in  learning”  by  the  linear 
combinations  (equations  (6)  and  (7))  of  the  real  and  imaginary  components  of  the  deriva¬ 
tive  of  an  output  function,  described  in  the  previous  subsection,  causes  the  characteristics 
described  above. 

4  Conclusions 

We  investigated  the  fundamental  characteristics  of  the  Complex-BP  algorithm  and  found 
that  the  Complex-BP  had  some  inherent  properties.  In  particular,  the  average  conver¬ 
gence  speed  was  superior  to  that  of  the  Real-BP.  In  addition,  the  number  of  learning 
parameters  needed  was  only  about  half  that  of  the  standard  BP.  It  is  interesting  that 
such  characteristics  appear^  only  by  extending  neural  networks  to  complex  numbers. 
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Table  2  Computational  complexity  of  the  Complex-BP  and  the  Real-BP  [Experiment  1]. 
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Abstract 

In  this  paper  we  explore  the  Elman  recurrent  network  by  constructing  and  identifying  finite 
state  automata  (FSA)  for  the  addition  task.  By  constructing  a  Mealy  machine  for  addition,  non- 
deterministic  elements  in  the  training  data  were  identified.  The  training  performance  of  different 
training  strategies  were  investigated  with  non-deterministic  data  versus  deterministic  data.  To  iden¬ 
tify  a  FSA  for  addition,  we  analyze  the  internal  representations  of  the  network  by  using  Hierarchical 
Cluster  Analysis  as  well  as  suggesting  Sammon  Transformation  Analysis  as  a  superior  clustering  tech¬ 
nique  as  opposed  to  the  more  familiar  Principal  Component  Analysis.  These  techniques  together 
with  the  Mealy  machine  clearly  identify  the  states  of  the  finite  state  machine  for  addition. 


1  Introduction 

Elman  [Elman,  1990]  introduced  a  simple  recurrent  architecture  that  has  the  potential  to  master  an 
infinite  set  of  sequences  by  copying  the  pattern  of  activation  of  the  hidden  units  onto  a  set  of  context 
units  which  feed  into  the  hidden  layer  along  with  the  inputs.  In  this  paper  we  show  that  the  Elman 
simple  recurrent  network  (SRN)  can  learn  to  mimic  closely  a  finite  state  machine  (FSM),  both  in  its 
behaviour  and  in  its  state  representation.  We  start  by  constructing  a  Mealy  machine  for  the  addition 
task  to  aid  in  identifying  the  finite  state  machine  of  the  network  dynamics.  The  Mealy  machine  also 
enabled  us  to  identify  non-deterministic  elements  in  the  training  data.  As  a  spin-off  experiment  we 
investigated  the  training  performance  of  different  training  strategies  for  training  with  non-deterministic 
data  versus  training  with  deterministic  data. 

To  analyze  the  internal  representations  of  the  Elman  network,  we  have  used  not  only  fauniliar  tech¬ 
niques  such  as  Hierarchical  Cluster  Analysis,  which  describes  the  static  representation  of  the  network 
dynamics,  and  Principal  Component  Analysis,  which  gives  a  more  dynamic  representation  of  network, 
but  also  introduce  a  fairly  unfamiliar  technique  called  Sammon  Transformation  Anailysis,  which  depicts 
network  dynamics  also  in  a  more  dynamical  fashion.  We  show  further  how  the  results  of  these  techniques 
coincide  and  are  congruent  with  the  Mealy  Machine  drafted  in  section  3,  and  also  clearly  identify  the 
states  of  a  Moore  machine  for  addition. 


2  Addition  Experiment 

The  aim  of  the  addition  experiment  [Cottrell  &  Tsung,  1991]  is  to  leeirn  to  sequentially  add  two  base 
four  numbers.  Each  base  four  number  is  given  a  two-digit  bineury  representation.  The  Elman  SRN  is 
confined  to  one  column  of  digits  at  a  time.  It  has  five  inputs,  one  indicating  the  end  of  the  input  and  four 
representing  the  one  column  of  digits,  16  hidden  ^lnd  16  context  units,  and  six  output  units  representing 
the  sum  (two  units)  of  the  one  column  of  digits  and  the  four  possible  actions  (four  units).  Actions  are  to 
write  the  sum,  to  remember  or  output  the  carry,  shift  to  the  next  column  of  digits,  and  indicate  if  done. 
The  network  is  trained  to  produce  a  sequence  of  outputs,  having  an  action  and  result  field,  according  to 
the  following  program: 

vhile  not  done  do  begin 

output (WRITE,  lov.order.result); 
if  sun>radix  then  output(CARRT, ’00’) ; 
output(IEXT, ’00’); 
end 

if  carry _on_previous_input  then  outputCWRITE, ’01’)  ; 
output (DOIE , ’ 00 ’ ) ; 


Since  the  Elman  SRN  is  required  to  learn  a  sequence  of  inputs  and  for  each  input  a  different  sequence 
of  outputs  as  well  as  looping,  the  addition  task  is  quite  complex. 

3  Mealy  Machine  for  Addition 


Figure  1:  Mealy  Machine  for  Addition 

A  Mealy  machine  was  constructed  to  characterize  the  addition  task  more  precisely  and  to  help  identify 
the  finite  state  machine  of  the  network  dynamics.  A  Mealy  machine  is  a  5-tuple  {S,A,T,0,f),  where 
5  is  a  finite  set  of  states,  A  is  a  finite  set  called  2tn  input  alphabet,  T  :  S  x  A  S  is  the  transition 
function,  O  is  the  output  alphabet,  and  f  :  S  x  A—*  0\s  the  output  function.  For  the  simple  recurrent 
network  of  the  addition  problem  the  input  patterns  are  the  input  alphabet  of  the  Mealy  machine,  whilst 
the  target  output  patterns  are  the  output  alphabet. 

The  Mealy  machine  in  Figure  1  describes  all  the  input-output  combinations  in  the  addition  problem. 
Each  transition  represents  a  specific  group  of  input-output  transitions,  which  is  specified  in  Tables  1 
and  2.  The  top  half  of  the  Mealy  machine  describes  the  input-output  combinations  involved  in  zero  or 
one  c^trry,  whereas  the  bottom  half  depicts  those  input-output  transitions  involved  in  more  than  one 
carry  (top  and  bottom  halves  indicated  in  the  figure). 

The  result  input-output  combinations  are  denoted  by  Rx,  where  x  is  the  type  of  result  action  indicated 
by  N,C,  D,CN,CC,  and  CD.  Rn  is  the  result  actions  that  lead  to  next  actions,  whereas  Rc  actions 
lead  to  carry  actions.  Rcn  and  Rcc  are  result  actions,  which  incorporate  the  changes  in  the  result 
field  due  to  carry  actions  earlier  in  the  current  temporetl  pattern.  They  represent  result  actions  that 
respectively  lead  to  next  anu  carry  actions.  Rd  and  Rcd  are  the  final  result  actions  that  leaul  to  done 
actions,  where  the  former  is  part  of  a  temporal  pattern  that  only  includes  one  carry,  while  the  latter’s 
temporal  pattern  includes  more  than  one  carry. 

The  carry  input-output  combinations  are  denoted  by  Cx,  where  x  is  the  type  of  carry  action  indicated 
by  C  and  CC.  Cc  is  the  first  carry  actions  in  a  temporal  pattern,  while  Ccc  indicates  the  successive 
carry  actions. 

The  next  input-output  combinations  are  denoted  by  Nx,  where  x  is  the  type  of  next  action  indicated 
by  N,  C,  CN,  and  CC.  N/f  is  next  actions  contained  in  a  temporal  pattern  with  no  carry  actions  earlier 
in  the  temporal  pattern,  whereas  Nc  actions  indicate  one  caiiy  action  earlier  in  the  temporal  pattern. 
Ncn  and  Ncc  are  next  actions  which  indicate  more  than  one  carry  action  earlier  in  the  temporal  pattern. 
They  differ  in  that  the  latter’s  preceding  action  is  a  cetrry  (Ccc))  whilst  in  the  former’s  case  it  is  a  result 
action  (Rcn)- 

The  done  input-output  combinations  are  denoted  by  Dx,  where  x  is  the  type  of  done  action  indicated 
by  N,C,CN,  and  CC.  Dn  is  done  actions  that  are  preceded  by  a  next  action  {Nn),  whereas  Dc  actions 
are  preceded  by  a  result  action  {Rd)  which  is  due  to  a  carry  action.  Dec  is  done  actions  which  are 
performed  after  more  tham  one  carry  and  preceded  by  a  result  action  (Rcd)-  Dcn  is  done  actions  which 
are  performed  after  at  least  one  carry  auid  preceded  by  a  next  au;tion  (Ncn)- 

Table  1  specifies  the  Mealy  mau;hine  transitions  for  temporal  patterns  that  include  zero  or  one  carry 
action,  whereas  the  temporal  patterns  of  Table  2  include  more  than  one  carry.  In  both  tables  only  the 


Table  1;  Mealy  machine  transitions  for  zero  or  one  carry.  The  *  indicates  a  non-deterministic  transition. 


Table  2:  Mealy  machine  transitions  for  more  than  one  carry.  The  *  indicates  a  non-deterministic 
transition. 


one  column  of  digits  (the  top  and  bottom  digits)  are  shown  as  input.  The  end~of~input  bit  of  the  input  is 
not  shown,  because  it  is  zero  for  all  actions  except  for  the  done  actions  {Dn,Dc,Dcn  and  Dec)  when 
its  value  is  one. 


4  Training:  Non-determinism  versus  Determinism 

All  the  Mealy  machine  transitions  in  Tables  1  and  2  are  determinbtic,  except  those  marked  with  a  star. 
In  Table  1  there  is  a  non-deterministic  choice  between  the  result  actions  Rc  and  Rd  when  the  input  is 
0111  and  1011,  i.e.  similar  output  patterns  corresponding  to  different  result  2u;tions  exist  for  a  specific 
input.  In  Table  2  the  non-deterministic  choice  is  between  the  result  actions  Rcc  and  Rod  when  the 
input  is  0011,  1101,  and  1110.  One  way  to  make  these  choices  deterministic  is  to  change  the  end-of- 
input  bit  into  a  one  for  Rd  and  Rod  in  order  to  distinguish  them  uniquely  from  respectively  Rc  and 
Rcc-  Thus  every  output  pattern  corresponding  to  an  action  is  uniquely  mapped  onto  a  specific  input 
pattern.  This  is  also  logically  plausible,  since  Rd  and  Rcd  are  the  only  result  actions  leading  to  done 
actions.  The  next  interesting  step  was  to  determine  the  difference  in  training  performance  when  training 
with  noB-deterministic  data  (not  an  unique  input-output  mapping)  versus  deterministic  data  (an  unique 
input-output  moping).  The  training  performance  of  different  training  strategies  were  investigated  for 
these  two  cases. 

The  first  training  strategy.  Combined  Subset  Training  (CST)  [Cottrell  &  Tsung,  1991],  consists  of  di¬ 
viding  the  training  set  into  random  subsets,  where  training  occurs  on  combined  larger  subsets.  The  next 
training  strategy.  Increased  Complexity  Training  (ICT)  [Ludik  k.  Cloete,  1993],  differs  from  the  first  by 
dividing  the  training  set  not  into  random  subsets,  but  into  subsets  of  increasing  complexity,  each  one  hav¬ 
ing  a  termination  criterion.  We  have  also  proposed  two  incremental  training  strategies  C2tlled  Incremental 
Subset  Training  (1ST)  and  Incremental  Increased  Complexity  Training  (IICT)  [Cloete  k  Ludik,  1994]. 
These  strategies  incrementally  increase  subset  size  and  consist  of  two  nested  loops:  (aj  na  inner  loop 
which  decrements  the  RMS  termination  values  in  a  linear  fashion  for  the  incremental  subsets  until  the 
desired  RMS  criterion  is  reached;  (b)  an  outer  loop  which  repeats  until  successful  generalization  on  an 
independent  test  set.  For  1ST  training  occurs  on  incremental  subsets  of  random  complexity,  whereas 
IICT’s  incremental  subsets  increase  in  complexity.  These  four  training  strategies  were  compared  to 


in-710 


IVuning 
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23.6% 
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41% 

9.6  % 

34.2% 

FST 

- 

- 

57.1% 

Table  3;  A  comparison  of  addition  simulation  results  for  the  non-deterministic  and  deterministic  cases 


Fixed  Set  Training  (FST),  where  a  network  is  trained  with  a  fixed  set  of  training  patterns.  The  addition 
simulation  results  for  the  non-deterministic  and  deterministic  cases  are  summarized  in  Table  3. 

In  the  non-deterministic  case,  all  four  training  strategies  improved  the  number  of  updates  by  more 
than  40%  compared  to  FST,  1ST  being  the  pick  of  the  strategies  by  achieving  53.3%.  In  the  deterministic 
case,  1ST  again  performed  very  well  by  improving  performance  by  47%  compared  to  FST.  There  is  a 
substantial  difference  in  training  performance  when  training  with  non-deterministic  data  versus  training 
with  deterministic  data.  This  is  confirmed  by  the  results  in  the  last  column  of  Table  3,  where  sJl  the 
training  strategies  performed  much  better  with  the  deterministic  data.  Noteworthy  results  are  those 
of  1ST  and  FST,  which  obtained  improvements  of  respectively  51.3%  and  57.1%.  We  suspected  that 
training  would  be  easier  with  the  deterministic  data,  but  were  quite  surprised  at  the  vast  improvements. 
Especially,  when  one  considers  that  only  one  bit  in  149  input  patterns  was  changed  out  of  a  possible 
2305  input  patterns  with  a  length  of  11  bits  (that  is  only  about  0.6%  change  in  the  total  fixed  training 
set).  These  results  emphasize  the  importance  of  identifying  the  finite  state  machine  of  the  training  data 
in  order  to  eliminate  the  non-deterministic  elements,  if  possible. 

5  Analysis  of  Internal  Representations 

In  this  section  we  analyze  the  hidden  unit  activations  by  using  Sammon  Transformation  Analysis,  Prin¬ 
cipal  Component  Analysis,  and  Hierarchical  Cluster  Analysis.  We  show  further  how  the  results  of  these 
techniques  identify  the  states  of  a  Moore  machine  for  addition. 

For  analyzing  purposes  we  have  used  the  weight  matrices  of  the  best  training  strategy,  1ST,  in  the 
classification  process  of  8-10  column  addition.  We  have  extracted  the  16  hidden  unit  activations  over 
time,  as  the  Elman  network  processed  the  classification  data,  which  consisted  of  ten  temporal  patterns 
constituting  233  single  input  patterns. 

5.1  Sammon  IVansformation  Analysis 

Sammon  Transformation  Analysis  (STA)  [Sammon,  1969]  is  a  data  transformation  technique  that  maps 
multidimensional  vectors  onto  two  or  three  dimensional  vectors,  whose  intervector  distances  tend  to 
approximate  those  of  the  multidimensional  vectors. 

In  Figure  2(a)  we  show  the  projection  of  the  hidden  units  vectors  onto  two  dimensions  as  the  network 
is  doing  the  233-step  addition.  The  clusters  formed  by  the  projected  hidden  unit  activations  correspond 
vividly  to  the  different  types  of  actions  that  the  network  are  required  to  learn.  Six  clusters  can  be  identi¬ 
fied  that  correspond  to  the  main  transitions  of  the  four  different  actions,  namely  Next-ResuH(NR),  Result- 
Next  (RN),  Carry-Next  (CN),  Result-Carry  (RC),  Result-Done  (RD),  and  Next-Done  (ND).  Along  the 
x-axis  the  network  is  distinguishing  between  a  Next  that  follows  a  Carry  (CN)  versus  one  that  follows  a 
Result  2u:tion  (RN).  Along  the  y-axis  the  network  is  distinguishing  between  a  Done  that  follows  a  Result 
(RD)  versus  one  that  follows  a  Next  action  (ND). 

Figure  2(b)  illustrates  the  correspondence  between  the  STA  data  and  the  Mealy  Machine  transitions 
in  the  previous  section.  The  following  mapping  exists  between  the  transition  clusters  in  Figure  2(a) 
and  the  Mealy  Machine  transitions:  NR  =  {iijv.  Rc,  Rd>  Rcn,  Rcc,  Rcd}\  =  {Nn,Ncn]',  CN  = 
{Nc,Ncc}‘,  RC  =  {Cc.Ccc};  RD  =  {Dc.Dcc}',  and  ND  =  {Dn,Dcn}-  Another  interesting 
result  is  the  clear-cut  separation  between  clusters  that  represent  actions  involved  in  a  carry  and  clusters 
representing  actions  not  involved.  Figure  2(b)  also  shows  the  existence  of  two  groups  of  actions  in  the 
NR  cluster,  namely  a  no-carry  group  {Rn,Rcn}  and  y-group  {Rc,Rd,RcCiRcd}-  The  fact 
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Figure  2:  Sammon  Ttsuisformation  Analysis  of  233-step  8-10  column  addition  -  (a)  transition  graph  (b) 
Mealy  machine  correspondence 


that  this  was  not  evident  in  Figure  2(a)  shows  the  importance  of  using  finite  state  automata,  such  as  a 
Mealy  machine,  in  the  analysis  process. 

5.2  Principal  Component  Analysis 

Principal  Component  Analysis  (PCA)  is  a  technique  whereby  multidimensiontd  vectors  are  mapped  onto 
a  new  set  of  orthogonal  linear  vectors,  where  the  first  principal  component  is  such  that  the  projections 
of  the  given  points  onto  it  have  maximum  variance  among  ail  possible  line2ir  coordinates;  the  second 
principal  component  has  maximum  variance  subject  to  being  orthogonal  to  the  first;  and  so  on.  In 
Figure  3(a)  we  show  the  projection  of  the  hidden  units  vectors  onto  the  plane  of  the  first  two  principal 
components  as  the  network  is  doing  the  233-step  addition.  The  figure  illustrates  the  correspondence 
between  the  PCA  data  and  the  Mealy  Machine  transitions,  which  is  similar  to  the  STA  correspondence. 
The  Result  actions  are  generally  in  the  left  half  of  the  space,  wheras  the  Nexta  and  Carrya  tire  in  the  right 
half.  Along  the  second  principal  component  the  network  is  distinguihing  between  a  Next  that  follows  a 
Carry  (CN)  versus  one  that  follows  a  Result  action.  Clusters  that  represent  actions  involved  in  a  Carry 
can  be  linearly  separated  from  clusters  representing  actions  not  involved. 

Graphs  similar  to  Figure  2(a)  were  also  generated;  again  six  clusters  were  identified  that  correspond 
to  the  main  trwsitions  of  the  four  actions.  We  have  also  obtmned  similar  results  by  plotting  the  first 
principal  component  at  time  t  versus  t-f  '  vhich  basically  gives  a  mapping  from  the  context  vector  to 
the  next  hidden  vector. 

By  comparing  Figures  2(a)  and  3(ay  uite  evident  that  STA  produces  superior  clustering  results 
as  oppo'  id  to  PCA  for  this  experiment.  Wc  conjecture  that  this  will  be  the  case  for  other  experiments 
as  well,  since  STA  preserves  in  a  certain  sense  the  intervector  distances,  whereas  PCA  discards  them. 

5.3  Hierarchical  Cluster  Analysis 

Hierarchical  Cluster  Analysis  (HCA)  is  a  method  of  finding  the  optimal  partition  of  training  vectors 
:u:cording  to  some  similarity  measure,  such  as  Euclidian  distance.  The  matrix  of  Euclidean  distances 
between  each  pair  of  hidden  activation  vectors  of  the  233-step  euldition  served  as  input  to  a  cluster 
einalysis  program.  In  the  graphical  results  of  this  analysis,  each  leaf  in  the  tree  corresponds  to  a  p2irticular 
transition  from  one  action  to  another.  From  this  graphs,  the  activation  patterns  are  grouped  according 
to  the  six  main  transitions  between  the  different  actions,  as  was  the  case  with  STA  2md  PCA.  We 
have  also  plotted  graphs  illustrating  the  correspondence  between  the  HCA  data  and  the  Mealy  machine 
transitions,  which  is  similar  to  the  STA  and  PCA  correspondences. 

5.4  Identification  of  Moore  Machine 

The  STA,  PCA,  and  HCA  clustering  analysis  techniques  show  clearly  how  the  hidden  activations  classify 
the  main  transitions  of  the  four  different  actions.  The  clusters  obtained  with  these  techniques  correspond 
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Figure  3:  (a)PTincipal  Component  Analysis  of  233-step  column  addition  (b)  Moore  m2u:hine  for  addition 


to  the  states  of  a  Moore  machine,  which  is  an  appropriate  FSM,  since  it  is  easier  to  construct  the 
complete  machine  from  these  clusters.  The  remaining  task  was  to  determine  the  Moore  machine’s  input 
and  output  alphabet,  as  well  as  its  transition  and  output  functions.  Figure  3(b)  presents  the  graphical 
representation  of  the  Moore  machine,  which  describes  the  hidden  layer  dynamics  for  addition.  The  six 
transitions  between  actions  are  the  states  S  =  {NR,RC,CN,RN,ND,RD},  where  NR  is  the  stzirt 
state  and  RD  and  ND  the  final  states.  The  input  symbols  of  the  input  alphabet  A  =  {Oc,  On,  Ic,  In}  are 
represented  in  such  a  manner  that  0  or  1  indicates  respectively  not-end-of-inpui  and  end-of-input,  and 
c  and  n  respectively  carry  and  no-carry.  The  output  alphabet  is  defined  by  O  =  {R,C,N,D},  where 
the  output  symbols  respectively  are  ResuU,  Carry,  Next,  and  Done.  Each  state  of  the  Moore  machine 
correspond  to  Mealy  machine  transitions,  as  described  in  section  5.1. 

6  Conclusions 

We  have  investigated  the  Elman  recurrent  network  by  constructing  a  Mealy  machine  for  addition  and 
identifying  a  Moore  machine  that  corresponds  with  the  internal  representations  of  the  network.  The 
construction  of  the  Me^dy  machine  also  enabled  us  to  identify  non-deterministic  elements  in  the  tr^uning 
data.  We  have  demonstrated  with  five  training  strategies  that  training  is  much  easier  (in  two  cases  more 
than  50%)  with  deterministic  data  as  opposed  to  non-deterministic  data,  even  though  the  differen»a 
in  the  two  training  sets  was  only  0.6%.  1ST,  the  best  training  strategy,  improved  performance  in  the 
non-deterministic  case  by  53%  compared  to  fixed  set  trmning  and  in  the  deterministic  case  by  47%. 

We  have  analyzed  the  internal  representations  of  the  network  by  using  Hier^trchical  Cluster  Analysis 
and  suggesting  Sammon  IVansformation  Analysis  as  a  superior  clustering  technique  when  compared  with 
Principal  Component  Analysis.  We  have  also  showed  how  the  clusters  formed  by  these  techniques  clearly 
identify  the  states  of  a  Moore  machine  for  addition. 
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Abstract 

We  show  that  the  standard  criterion  function  of  a  feedforward  neural  network  is  Lipschitzian. 
Procedures  are  developed  to  compute  efficiently  local  Lipschitz  constants  over  subsets  of  the 
weight  space.  Local  Lipschitz  constants  can  be  used  to  compute  lower  bounds  on  the  optimal 
solution.  They  can  also  be  used  to  identify  weight  subregions  that  do  not  contain  promising 
solutions,  hence  reduce  the  search  space. 


1  Introduction 

The  backpropagation  (BP)  algorithm  and  its  many  variations  are  the  most  popular  training  algo¬ 
rithms  for  feedforward  neural  networks  (FNNs).  However,  those  gradient  based  training  algorithms 
have  some  limitations.  One  of  them  is  obtaining  only  local  minimum  solutions.  A  local  minimum 
may  or  may  not  represent  an  acceptable  solution. 

Empirical  results  have  shown  that  with  ample  hidden  units  embedded  in  the  network,  BP 
can  usually  escape  a  local  minimum  (Rumelhart  et  al.,  1986)  probably  due  to  large  degrees  of 
freedom.  However,  increasing  hidden  units  in  the  network  may  not  be  an  appealing  idea,  since 
an  unnecessarily  large  number  of  hidden  units  is  likely  to  decrease  the  generalization  capability  of 
the  network  (Kruschke  and  Movellan,  1991;  Baum  and  Haussler,  1989)  and  may  cause  overfitting 
problems  (Weigend  et  al.,  1990).  In  this  paper,  we  show  that  an  FNN  is  Lipschitzian.  Thus  various 
Lipschitz  optimization  methods  (e.g.,  Piyavskii,  1972;  Horst  and  Tuy,  1990)  can  be  applied  to  neural 
network  training.  This  approach  would  overcome  the  problem  of  converging  to  a  local  minimum 
and  yield  a  globally  optimal  solution  (Tang  and  koehler,  1993a). 

2  An  FNN  is  Lipschitzian 

Lipschitz  optimization  deals  with  the  global  optimization  of  a  wide  class  of  functions — the  Lipschitz 
functions.  In  the  following,  we  first  give  the  definition  of  Lipschitz  functions.  Then  we  show  that 
the  standard  sum-of-square  error  (SSE)  of  an  FNN  is  Lipschitzian. 

Definition  2.1  (Lipschitz  function)  A  continuous  function  F  :  M  R,M  C  is  a  Lipschitz 
function  if  there  exists  a  constant  L  =  L{F,M)  >  0  such  that 

^x,yeM 

where  S  is  a  positive  integer  and  L  is  called  a  Lipschitz  constant. 

Knowing  the  Lipschitz  constant  of  a  function  F  provides  a  way  of  computing  lower  bounds  on  the 
global  minimum  of  F.  Suppose  we  want  to  minimize  F  over  M,  let  6{M)  =  max  {||a;  — yl|  |x,  y  €  M] 
be  the  diameter  of  M.  From  the  definition  of  Lipschitz  function,  we  have 

Fiy)  >  F{x)  -  L\\x  -  y||  >  F{x)  -  L6{M),  Vx,  yeM. 


ra-714 


If  jF’(x)  is  known  for  some  x  €  M,  then  /’(x)  —  LS(M)  gives  a  lower  bound  to  the  global  minimum 
of  F  over  M.  The  following  lemmas  are  needed  before  we  develop  the  procedures  that  give  easily 
computable  lower  bounds  on  the  Lipschitz  constant  for  an  FNN. 

Lemma  2.1  Let  fi  :  R,  i  =  1,2,  be  Lipschitzian  xoith  Lipschitz  constants  Li,  respec¬ 
tively.  Then  F  :  R,  given  6y  jF  =  /»  Lipschitzian,  and  a  Lipschitz  constant  of  F  is 

given  by  Lp  =  Yli 

Lemma  2.2  Let  x  6  iZ”,  and  F(x)  =  /(y(*)),  where  f  :  R  -*  R,  g  :  R'^  R  are  Lipschitzian 
with  Lipschitz  constant  L/  and  Lg,  respectively.  Then  F(x)  has  a  Lipschitz  constant  Lp  given  by 
Lp  =  LjLg. 

Lemma  2.3  Let  x  6  iZ",  the  lp  norm  on  JZ”,  for  1  <p  <  oo,  satisfies 

i=l 

Lemma  2.4  Let  x  €  R^,  and  F(x)  =  /(y(x)),  where  f  :  R”'  R  is  Lipschitzian  with  Lipschitz 
constant  Lj,  g  :  R^  iZ”*  with  components  gi,i  =  1,2,  ...,m  being  Lipschitzian  with  Lipschitz 
constant  Lg^.  Then  F{x)  has  a  Lipschitz  constant  Lp  given  by  Lp  =  LjYll^x  ^gi- 

In  the  following  discussion,  we  assume  the  standard  sigmoid  activation  function  /  (with  range 
(0, 1.0))  is  used.  The  transfer  function  is  a  linear  function  of  the  inputs  from  the  previous  layer  with 
a  constant  term  (the  bias).  L  is  used  to  denote  a  Lipschitz  constant  with  subscripts  identifying  the 
corresponding  functions. 

For  a  single-output  FNN  with  one  hidden  layer,  the  output  of  the  network  is 

h  n 

O  =  f{w,x)  =  f(^Wjfj(^WijXij  +  Woj)  +  wo) 
j=l  i=l 

where  h  is  the  number  of  hidden  units,  and  fjS  are  activation  functions  in  the  hidden  layer.  Note 
that  the  output  o  can  be  written  as  a  composite  function  o  =  f(g{w,x)),  where 


h  n 

g  =  Y.  +  '^0 

j=i  »=i 

Applying  Lemma  2.2,  we  have 

Lq  =  LjLg. 


Lf  is  given  by 


Lf  =  max  ||V3/(u;,x)|| 

=  maxyf{l  —  f),  Vu?  G  (1) 

The  function  g  can  be  rewritten  as  yo(/^)»  where  go  :  R^  -*  R  transfers  the  hidden  layer  output 
to  the  output  layer  input,  go  can  be  written  as 


go  =  Wnf^  -I-  M^o 
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where  Wh  is  the  set  of  weights  between  the  hidden  layer  and  the  output  layer,  and  wq  is  the  output 
unit  bias.  maps  the  output  from  the  input  layer  to  the  input  of  the  hidden  layer. 

The  components  of  are  given  by 

n 

/f  =  +  ’^0}),  j  =  1, 2, ...,  h. 

i=l 

Applying  Lemma  2.4,  we  have 

k 

^9  =  ^9o  5!^ 

i=i 

where  Lg^  is  given  by 

Lg^  =  max  ||V/5oll 
h 

=  max  +  (2) 

j=i 

Note  that  in  the  hidden  layer  is  equivalent  to  the  output  function  of  an  FNN  without  a  hidden 
layer.  We  have 

n 

LjH  =  max  ifjil-  + 

•’  «=i 

Putting  the  above  together,  we  have,  for  a  single  hidden  layer  FNN, 

h  h  n 

Lo  =  max  7/(1  -  /)  max  (1  +  H  -  /j)(l  +  ^  *.•  )*  (3) 

j=l  i=i  t=i 

and,  with  F  =  Dp  =  2  5Zp(*p  ~ 

Lpp  =  max  |tp  -  Op\Lop,  Vtn  €  W,  (4) 

where  is  given  by  Equation  3  with  the  input  Xp.  We  observe  that  /  and  fjs  are  functions  of 
the  weights  and  the  maximization  is  taken  over  the  whole  weight  space,  although,  with  the  layered 
structure,  fjS  depend  only  on  hidden  layer  weights.  Recadl  that  Lp  =  Dp^Fp>  thus 

Zf  = 

p=i 

=  H  I^P  -  Opl  max  (1  +  5]/j  )"  -  /i)(l  +  ^  (5) 

p=i  j=i  i=i  t=i 

Hence,  we  have  developed  a  procedure  for  estimating  the  Lipschitz  constant  for  FNNs  with  a  single 
output  unit  and  a  single  hidden  layer. 

Equation  3  can  be  used  in  estimating  the  Lipschitz  constant  for  a  general  three  layer  FNN 
(Which  is  the  most  widely  used  NN  structure).  Let  k  be  the  index  for  the  output  processing  units, 
then  for  each  output  unit  Ok,  we  have 


Ijq^  max 


h  ^  h 

7/jfc(l  -  h)  max  (1  +  5^  //)5  Yh 
i=i  j=i 


t=l 


(6) 
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Consider  the  criterion  function 


P=1  P=1  fe=l 

for  each  trsdning  pattern  p,  Fp  =  /,(/<,),  where  /,  ;  — ►  R  maps  the  network  output  to  a 

performance  measure,  and  fo-R^—*  R^  maps  the  hidden  layer  output  to  the  input  to  the  output 
layer.  Observe  that  each  component  of  fo  is  equivalent  to  the  output  function  of  a  three  layer  FNN 
with  a  single  output,  the  case  discussed  in  the  above  subsection.  Let  Ok,k  =  1,2,  ...,K  denote  the 
component  function  of  /o,  Ok  is  Lipschitzian  with  Lipschitz  constant  given  by  6.  By  Lemma  2.4, 
the  Lipschitz  constant  for  Fp  is 

fc=l 

where  Z/,  is  given  by 

Lf,  =  max  IIVoZ’pll 
K 

=  max  (Yl^tpk  -  Ppkf)'^ ■  (7) 

k=\ 

Thus  for  the  criterion  function  F,  we  have  a  Lipschitz  constant  (using  Lemma  2.1  again) 


P  K  K 

Lf  =  Y^  max  (^{tpk  -  Ppk?)^  Y,  ^ok  (®) 

p=l  *=1  k=l 

This  leads  to  the  following  proposition. 

Proposition  2.1  The  criterion  function  representing  a  three  layered  feedforward  neural  network 
is  Lipschitzian  voith  a  Lipschitz  constant  given  by  Equation  8. 

Extension  of  the  procedure  to  estimating  Lipschitz  constant  for  an  FNN  with  more  than  one 
hidden  layers  can  be  carried  out  by  applying  the  basic  lemmas  recursively,  as  illustrated  above. 


3  Compute  Local  Lipschitz  Constants 

The  procedures  outlined  above  allows  us  to  compute  Lipschitz  constant  over  subsets  of  the  weight 
space.  Furthermore,  the  estimation  of  Lipschitz  constant  is  computationally  efficient.  For  clarity 
of  exposition,  we  will  consider  computing  the  Lipschitz  constant  of  a  three  layer  FNN  with  a  single 
output  unit.  Using  Equation  3,  we  can  compute  the  Lipschitz  constant  of  the  criterion  function 
with  a  given  training  pattern  p  by 

h  h  n 

Lo  =  max  7/(1  -  /)  max  (1  +  X)  /j  )*  Y  ~  + 

j=i  j=i  .=1 

Lpp  =  fnax  \tp  -  Op\Lo,  'iw  €  W. 

Four  maximization  problems  need  to  be  solved  over  a  given  weight  subset.  Solving  those  prob¬ 
lems  may  seem  to  be  difficult  as  the  functions  are  nonlinear  and  nonconvex.  However,  by  exploiting 
the  properties  of  the  sigmoid  activation  function  and  the  special  structure  of  the  FNN,  we  can  ef¬ 
fectively  solve  those  problems  over  a  weight  subset,  when  the  weight  subset  is  a  hyper-rectangle  in 
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Table  1:  Lipschitz  Constant  over  Weight  Subsets 


Weight  Subset 

Hyper-rectangle  vertices 

Lipschitz  Constant 

Wo 

LV=(-10  -10  -10  -10  -10  -10  -10  -10  -10) 
UV=(  10  10  10  10  10  10  10  10  10) 

1.20388 

Wi 

LV=(  0  0  0  0  0  0  -10  -10  -10) 

UV=(  10  10  10  10  10  10  10  10  10) 

0.89769 

w. 

LV=(  50000000  0) 

UV=(  10  10  10  10  10  10  10  10  10) 

0.01584 

Wj 

LV=(  50050000  0) 

UV=(  10  5  5  10  10  10  10  10  10) 

0.00793 

W4 

LV=(  05555055  0) 

UV=(  5  10  10  10  10  5  10  10  5) 

0.00792 

Ws 

LV=(  05000000  0) 

UV=(  5  10  5  10  10  10  10  10  10) 

w. 

LV=(  05050000  0) 

UV=(  5  10  5  10  5  10  10  10  10) 

0.01167 

Wr 

LV=(  0  0  0  0  0  0  -5  -5  -5) 

UV=(  55555555  5) 

0.89769 

w. 

LV=(  2.5  2.5  0  0  0  0  0  0  0) 

UV=(  55555555  5) 

0.05438 

Wo 

LV=(  2.5  2.5  2.5  2.5  0  0  0  0  0) 

UV=(  55555555  5) 

0.00880 

W,o 

LV=(  -5  -5  -5  -5  -5  -5  -5  -5  -5)  - 

UV=(  00000000  0) 

0.74146 

the  weight  space.  Details  of  solving  those  maximization  problems  are  given  in  (Tang  and  Koehler 
1993b). 

Let  us  apply  the  above  procedure  to  estimating  the  Lipschitz  constant  of  the  2x2x1  XOR 
network.  Assuming  7  =  1,  applying  Equation  5,  a  theoretic  global  Lipschitz  constant  can  be 
computed  by 

p=l  i=l  j=l  i=l 

=  ~Vl  +  h(l  +  y/^  +  y/2  +  \/3) 
lo 

=  ^(l+2y/2  +  V3) 

=  1.20388.  (9) 


This  is  obtained  by  overestimating — assuming  the  weight  set  is  essentially  unbounded,  we  take 
|<p  —  OpI  =  1,  /(I  -  /)  =  1/4,  and  —  y/T+H-  By  actually  maximizing  those  terms 

over  a  given  weight  subset,  we  get  much  smaller  local  Lipschitz  constants  than  the  global  one  for 
each  partition  element. 

Table  1  shows  that  the  local  Lipschitz  constants  vary  significantly  over  different  weight  subre¬ 
gions.  These  subregions  are  hyper-rectangles  identified  by  the  lower  vertex  (LV)  and  upper  vertex 
(UV).  With  Wo  =  {tu  €  1  -  10  <  tw,-  <  10,  t  =  1,2,  ...,9},  the  local  Lipschitz  constant  is  ap¬ 

proximately  equal  to  the  global  Lipschitz  constant.  However,  for  some  still  relatively  large  weight 
subsets  (e.g.,  W3  and  W4)  the  Lipschitz  constants  are  quite  small.  Those  local  Lipschitz  constants 
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may  be  used  to  estimate  lower  bounds  on  the  global  criterion  function.  They  may  also  be  used  in 
identifying  subregions  in  the  weight  space  that  do  not  contain  promising  global  optimal  solutions. 
Hence  search  space  can  be  reduced 

4  Conclusions 

We  studied  the  Lipschitz  properties  of  the  feedforward  neural  networks.  We  have  shown  that  the 
sum-of-squared  error  criterion  function  of  a  feedforward  neural  network  is  Lipschitzian.  The  special 
structure  of  feedforward  neural  networks  makes  it  possible  to  estimate  the  Lipschitz  constant  in 
a  recursive  procedure.  Furthermore,  by  exploiting  the  structure  of  the  network  and  the  property 
of  the  sigmoid  activation  function  the  computation  of  local  Lipschitz  constant  can  be  efficiently 
carried  out. 

Local  Lipschitz  constant  can  be  used  either  to  compute  lower  bounds  on  the  optimal  solution, 
or  to  describe  approximately  the  topology  of  weight  subsets.  It  is  weU  known  that  the  error  surface 
of  a  feedforward  neural  network  is  composed  mainly  by  flat  plateaus  and  some  deep  valleys  that 
contain  local  or  global  minimum  solutions.  Our  procedure  provides  a  way  to  identify  those  flat 
areas  and  may  be  used  to  reduce  the  search  space  in  neural  network  training. 
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Some  Remarks  about  Boundary  Creation  by 
Multi-Layer  Perceptrons 


Konrad  Weigl  and  Marc  Berthod* 
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Abstract.  In  the  present  pater,  we  challenge  the  classical  concept  that 
hidden-layer  neurons  form  necessarily  individual  segments  of  boundaries 
between  classes  when  used  for  classification:  in  order  to  fulfuU  such  a 
purpose,  the  majority  of  the  output  values  of  the  neurons  must  be  close 
to  0  or  1  for  the  input  samples  given.  We  introduced  in  [7]  [8]  [9]  a 
new  learning  algorithm.  Projection  Learning:  Given  enough  neurons,  the 
weights  from  input  to  hidden  layer  computed  by  that  algorithm  are  so 
small  that  nowhere  does  any  output  from  the  hidden  layer  get  into  the 
proximity  of  the  upper  or  lower  bound.  The  function  approximation  and 
classification  is  thus  not  formed  by  individual  neurons  forming  boundary 
segments,  but  by  the  linear  or  non-linear  superposition  of  the  outputs 
of  the  neurons  of  the  hidden  layer.  We  introduce  briefly  the  new  algo¬ 
rithm,  compare  it  to  classical  algorithms  such  as  backpropagation,  show 
the  relevant  statistics  oi  the  output  values  of  the  hidden-layer  neurons  in 
a  real-world  example,  and  conclude  upon  the  relevancy  of  the  findings. 
Keywords:  Tensor  Theory,  Projection  Operators,  Metric  Tensors,  Ra¬ 
dial  Basis  Functions,  Multi-layer  perceptrons 


1  The  general  approach 

In  the  present  paper  we  shall  concentrate  on  the  mathematical  aspects.  Refer 
to  the  papers  above  and  [10]  for  the  paradigm.  The  aim  of  approximation  is 
modelized  by  the  aim  to  minimize  a  function  E  —  ~ 

F(xk)  being  the  function  to  be  approximated,  A(xk)  the  approximating  function, 
xt  the  set  of  input  values,  k  €  {l,..,n},  gi,i  £  the  set  of  arbitrary 

differentiable  functions  computed  by  the  i'f  hidden-layer  neurons,  eind  gi(xt),  k  G 
{1, ..,  M|,  i  G  {1,  ••,  N]  the  output  values  computed  by  the  hidden-layer  neurons 
for  given  inputs  **.  We  shall  assume  a  linear  output  neuron^.  The  problem 
belongs  to  the  class  of  separable  non-linear  least  squares  [3]  [4]  [5].  The  difference 
to  backpropagation  is  that  we  are  computing  the  weights  from  hidden-layer  to 
output  layer  directly  at  each  step  for  given  input-  to  hidden-layer  weights,  have 
proven  that  this  approach  is  exact  (cf.  [10]  for  details),  iuid  shown  that  it  is 
fast,  see  below. 

*  email:  weigl^sophia.inria.fr  beithod@8opiiia.inria.fr 
Correspondence  to;  Konrad  Weigl 

*  We  have  shown  in  [10]  the  extension  to  a  non-linear  output  neuron  with  invertible 
activation  function;  extension  to  multiple  output  neurons  '.rivial 
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2  The  algorithm  step  by  step 


After  initialization,  the  steps  in  the  leaning  loop,  thus  per  iteration,  are  then 
the  following: 

1)  Compute  the  output  values  gi(xt)  of  2dl  the  filters/nodes/hidden-layer- 
neurons  i  for  all  the  input  s^uTlples  xt,  t  €  1,  M; 

2)  Multiply  pairwise  gi{xk)  with  gj{xk),  and  sum  over  all  these  products 
9i{xk)9j(zfc),  k  €  1,..,  M;  this  gives  us  the  scsdar  gij.  Do  this  for  all  Alters  i, 
j;  These  scalars  gij  are  the  components  of  the  cov^uiant  metric  tensor,  which  is 
a  symmetric  and  real  matrix.  Invert  the  matrix  by  any  method,  e.g.  neural,  of 
your  choice  [10].  This  gives  us  the  contravariant  metric  tensor. 

3)  Multiply  all  the  output  samples  given  F{xk),  k  £  I,..,  M  with  the  corre¬ 
sponding  Alter  outputs  9t(zt),  k  6  1, M  computed  above.  Sum  over  all  these 
products  F{xk)gi{xk),  it  6  1,  -.,  Af  ;  this  gives  us  the  covariant  component  Ai. 
Repeat  for  all  the  Alters. 

4)  Multiply  the  contravariant  metric  tensor  obtained  above,  which  is  a  matrix, 
with  the  vector  formed  by  all  the  covariant  components  A,;  this  gives  us  the 
vector  of  contravariant  components  A*  Multiplying  now  all  the  Alters 

with  the  corresponding  A*,  and  summing  up  over  all  the  indices  i,  gives  us  the 
function  approximation  for  the  network  and  input  S2unple  Xk,  called  A(xfc): 


Aixk)-^Yl^'gi{xk)  (1) 

t 

Thus  we  can  now  compute  the  distance  E: 

M 

E  =  J2(f'M-A(xk)f  (2) 

i=l 

5)  We  have  to  differentiate  E  with  regards  to  the  parameters/weights  of  the 
nodes/basisfunctions;  thus  we  need,  by  the  chain  rule,  for  example,  to  differen¬ 
tiate  w.r.t.  the  parameters  of  the  Alter  j,  called  paramsj  : 


djHlVQTfXSj 


implies 


dE 

dparamsj 


=  '£2{F{xk)-'£A'gi{xk)){A^ 

X  I 


dparamsj ' 


(3) 

(4) 


dparmt^  depends  on  the  type  of  basis  function  used,  obviously;  for  a  sigmoid, 
it  is  for  example:  [6]  =  srj(x*)(l  —  (**))**  for  the  computation  of  a 

®  The  superscripts  are  not  power  exponents! 

‘  We  have  shown  in  [10]  that  the  terms  depending  upon  .  cancel  out 
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non-bias  input,  and  =  5j(x*)(l  —  gj  to  cmpute  the  bias  weight. 

6)Once  we  have  computed  for  all  parameters,  we  can  compute: 


dpararnsj  _  dE  .  . 

dl  dparamsj  ' 

and  compute  one  step  via  gradient  descent  or  other;  then  we  reiterate  the 
loop  with  steps  1)  -  6),  until  the  minimal  distance  has  been  found.  Figure  1 
shows  the  evolution  of  the  training  of  the  network  on  the  XOR-function,  while 
hgure  2  shows  comparisons  of  computation  time  of  our  algorithm  with  standard 
gradient  descent  compared  to  backpropagation  with  standard  gradient  descent. 


3  Measurements  on  taught  network 


For  an  individual  neuron  w  ith  sigmoidal  activation  function  to  operate  as  a  seg¬ 
ment  of  a  boundary  between  classes,  its  output  must  be  close  to  0  or  one  for  a 
major  part  of  the  input  samples. 

We  have  applied  thus  the  algorithm  described  to  a  real-world  classification  task 
with  two  classes,  namely  the  detection  of  inhabited  areas  on  satellite  images 
[12],  using  &om  three  to  thirty  hidden-layer  neurons  and  a  linear  output  neuron. 
Computation  time  to  convergence  was  between  46  and  149  seconds  on  a  Sparc 
10. 

The  networks  were  then  tasked  to  classify  the  testset,  which  consisted  of  an 
384x384-pixels  image,  i.e.  roughly  147,000  samples.  We  made  then  individual 
histograms  of  the  outputs  of  all  the  individual  hidden-layer  neurons  for  the 
147,000  samples  given,  one  histogram  per  network  with  a  given  number  of  neu¬ 
rons.  Figure  3  shows  the  results:  From  top  left  to  bottom  right,  the  number  of 
neurons,  and  the  number  of  learning  iterations  is  increasing.  In  hindsight,  this  is 
obvious;  The  more  neurons,  the  more  chance  that  the  manifold  which  they  span 
in  function  space  is  close  to  the  function  to  be  approximated,  and  the  greater 
thus  the  chance  that  the  error/distance  is  small.  We  can  see  that  therefore,  with 
a  small  number  of  neurons,  the  neurons  have  to  adapt,  2uid  effectively  form 
boundaries,  as  we  can  see  from  the  distribution  peak  at  zero  for  small  number 
of  neurons;  when  the  number  of  neurons  increases,  however,  their  output  values 
tend  to  gather  around  the  original  random  initialization  value:  Thus  in  no  way 
do  individual  neurons  form  boundaries,  amd  only  the  linear  superposition  of  the 
outputs  allows  for  the  approximation  of  the  function.  This  meeuis  that  the  neu¬ 
rons  with  randomly  chosen  parameters  are  able  to  span  such  a  manifold,  and 
thus  represent  such  a  function,  if  there  are  enough  of  them.  This  is  a  result  akin 
to  the  coarse-codtny  paradigm  of  Rumelhart  [6],  the  population  coding  concept 
of  GaaJ  [2],  or  the  frame  concept  of  Daubechies  [1]  etc.,  though  we  sure  in  no 
way  projecting  here  to  a  higher  dimensional  space,  as  these  authors  eue  doing. 
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4  Conclusion 


The  present  study  would  have  been  impossible  with  backpropagation,  which 
would  have  modified  the  parameters  of  the  hidden-layer  neurons  before  any 
sensible  statistical  analysis  could  have  taken  place;  only  by  using  an  approach 
akin  to  ours,  which  computes  the  optimal  hidden-  to  output-layer  weights  for 
given  input-  to  hidden  layer  weights,  are  we  able  to  determine  how  many  neurons 
are  indeed  necessary  to  form  a  boundary: 

If  the  output  values  tend  to  stay  close  to  the  initial  values,  the  number  of  hidden- 
layer  neurons  is  higher  than  necessary;  only  if  the  distribution  of  output  values 
converges  towards  the  upper  and/or  lower  limit  do  these  neurons  function  as 
class  limiters.  In  ail  other  cases,  a  presumably  highly  redundant  superposition 
of  the  outputs  of  still  randomly-distributed  hidden-layer  neurons  fullfiUs  the 
approximation  task. 
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Fig.  1:  Evolution  of  the  system;  two  sigmoidal  filters,  four  samples;  top  left 
image  shows  the  original  XOR-function;  the  remaining  images,  top  left 
linewise  to  bottom  right,  show  the  evolution  of  the  system.  Data:  188 
iterations,  760  msecs  on  Sparc  10 


Fig.  2:  Time  to  convergence,  averaged  over  20  runs  each:  Projection 
learning  on  left,  backpropagation  on  right:  Both  same  initial 
r^uldom  weights,  convergence  time  in  secs 
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Fig.  8.  From  top  left  to  bottom  right,  increasing  number  of  neurons  of  the  hidden 
layer;  computation  time  varies  between  46  and  149  seconds  for  a  final  convergence  to 
between  1%  and  3%  max.  error 
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ABSTRACT 

This  paper  analysec  the  performanee  of  the  neural  network  loop  and 
giros  relationship  between  stored  patterns  U  and  probability  of  input 
element  p.  The  analysis  thows  that  If  neural  nodes  N  is  sufficiently 
large  than  the  stored  pattern  M,  with  high  probability  the  neural 
network  loop  conveiges  to  stored  pattern  rectors. 

1.  Introduction 


The  elessif ieatior  of  stationery  randoc  signals  and  associated 
signatures  may  be  performed  using  neural  network  techniques.  The  bidi¬ 
rectional  assoc iat ire  memory (BAU) [k87],  and  Hopfield  network [LB7]  can  do 
these  works.  The  papers [Z91]  giro  an  architecture  of  neural  network  loop 
which  perform  assoc iatire  memory. 

The  basic  system  tlat  we  shall  discuss  is  shown  In  Fig,  1.  we  assume  one 
layer  of  simple  mociol  neurons  projects  to  another  layer.  Suppose  we  hare 


three  sets  of  N  neurons,  called  X, 
projects  to  erery  ceuron  in  set  T. 


erery  neuron  i  in  set  T  by  way 
w.,  (i,  j),  forming  nn  NxN  connect 
tain  the  connectirity  matrices  W, 
tore  of  layer  X,  It  receires  the 
input  and  projects  the  outputs  to 


T  and  Z,  where  erery  neuron  in  set  X 
A  nouron  j  in  set  X  is  connected  to 


Fig.  2  Layer  X  Architecture 

a  modifiable  synapses  with  strength 
rity  matrix  W«,.  Similarly,  we  can  ob- 
,  and  W.«.  Fig.  2  shows  the  architec- 
input  signals  from  layer  Z  or  external 
layer  T.  The  nodes  sum  the  weighted 
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iuputc  fro  layer  X  or  external  input,  and  then  paeses  the  reeult  through 
a  hard  Isj^iting  function.  Using  follow  learning  rule,  we  can  obtain  the 
weight  ffi^tricea  Wya  and  W... 


W.y^SSJ-iY  (m)  X  (m) 

Wy„=i:“-xZ(m)Y(ie)'' 

W„=i;“-xX(ffl)Z(m)^ 


(1) 

(2) 

(3) 


where  X  (m)  -  lii  (o) ,  x*  (tn) , . . .  xn  (m)  )’*' 
Y  (la)  ^  (yi  tin) ,  ya  (m) , . . .  yn (m) 

Z  (m)  ^  Izi  (o) ,  *a  (m) . . . .  Bn  (m) } 


2.  PerforEaace  ' 


From  a  inpit  rector,  such  ta  X(d),  we  can  associate  other  two  correspond¬ 
ing  samplei [Z91]  Y (m)  and  Zlm)  stored  patterns  in  NNL.  It  is  especially 
useful  whe  1  input  a  partial  rector.  We  can  recall  other  two  rector 
samples  cc'reetly.  This  prored  that  NNL  has  the  ability  of  fault  to¬ 
lerance  an;l  ability  against  noise. 


It  can  be  irored  easily  that  NNL  performs  task  finished  by  Hopfield  and 
BAN  neural  networks [Z91] .  So  NNL  has  the  similar  properties  which  Hop- 
field  and  BAM  harelZSll.  The  paper [Z91]  has  prored  that  when  go=Po=  1/2, 
NNL  can  coiverge  with  high  probability,  where 

qo=Plx/“»M}»P{y/“’-l}=  Pfxi'"*-!}  (4) 
Po=Pis^“>=-l}=Ply/“»=-l}=  P{x/^=-l)  (6) 


are  the  probability  distribution  of  Xi  (m)  ^  yi(m)  and  ii(m). 

Now  we  analyses  conrergence  of  NNL  when  qo7^Po?^l/2.  If  NNL  is  addressed 
by  multiplexing  the  matrix  W«, (or  W,., W..)  with  one  of  state  rectors,  say 
X(mo)  (or  Y(mo)i  Z(fflo)),  it  yields  the  estimate 


u 


i  =  SY-iW.,(l,  jlxj*”*’ 

„  On)  _  On)  _  Cbo> 
-Z-' J -lZjn»-iy  I  *J_  Xj 
_tV.  (»«>  4.V  VN  »  too) 


Xto)y  to) 


(8) 


flcV  V**  -  to)  _  too)  «  On) 

ro  8=2-,  ny^moZj j'lXj  Xj  yi  . 


Ui  consist:  of  the  sum  of  two  terms,  the  first  is  corresponding  output 
rector  au^lified  by  N,  the  second  is  a  linear  coabUation  of  the 
remaining  stored  rectors  and  it  represents  an  unwanted  cross-talk  term. 
The  ralue  of  s  is  the  sum  that  11-1  term  add  randomly.  In  order  to  recall 
corresponding  stored  rectors  correctly.  We  hope  the  absolute  of  the 
first  term  is  larger  than  that  of  the  second  one.  First,  let's  discuss 
erery  term  of  the  second  term  in  Eq,  (6).  Let 


nxj<— > 


_  (m)  _  to) 
Xj  Jt 


(7) 


Suppose  probability  distribution  of  s/*^  are 

P{o/**  »l}»p 
P{s;”’-l}-q 


III- 727 


V/hofo  piq-l.  For  all  i  =  l,  2,  . , . ,  K,  m=l,  2,  .  .  ,  ,  1!  m  /mo  the 

probability  that  th  re  are  n  b/“^  aro  equal  to  1  is 

P  {8  =  2ii-N  (LI- 1) }  M  cci-if  “ 

^  “q"* [N  (M-1)  J I  /  (n{  [N  (M- 1)  -fl] ! ) 

Suppose  B  ic  racdcsj  variable  ii7ith  moan  value  ta«  and  variance  Wo  can 
obtain 

ia„=K(l]-l)  (p-q)  (8a) 

^?^4Nai-l)pq  (8b) 

The  Magnitude  of  s  can  be  rapreaented  ae 

V  ETB"'‘T=VW(fl-  T)  pq+  IN  (M-1) 

Now  wo  shall  discus.  Els^}  versus  p  and  rewrite  E{b”}  as 

G  (p)  =E{B“}=4N(Ei  1)  p  (l-p)  +N  01-1)  “  (2p-l)  “  (9) 

Use  the  following  derivative. 

G’  (p)=4N(M-l)  (l-2p)+4N(M-l)M2p-l)=0 

We  can  obtain  p=l/2.  Substituting  p=l/2  into  Eq,  (9).  We  have  G(l/2)  = 
N(M-l).So  when  p=q=i/2iG(p)  reaches  its  Dinimum  value  N(H-1)  and  in  this 
condition  the  network  has  the  highest  capacity, 

Now  discuss  Els®}  versus  po  and  from  Eq.  (6-7)  we  have 

p=2poqo+po  (qo>"Po>  (10) 

q=qo  (qS+pS) +2poqo  (II) 

Els”)  versus  po  detormioed  by  Eqs.  (9-11). 

Figs.  3-4  give  curvjs  VG (p)  versus  p  and  Po  respectively.  From  the 
figures  we  can  know  that  although  in  a  amail  domain  of  p  V G (  p)  is 
larger  than  N/2,  in  a  broad  domain  VG(p)  of  po  is  larger  than  N/2.  It  is 


Fig.  3  VGl^)  /ersuB  p  Fig,  4  \/G(p)  versus  po 
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that  v(s  eap  et.  In  thie  domain  tho  network  can  conrerge  to  correct 
pattern  yceore.  Bocauae  oquationa  (10-11)  convert  the  narrow  domain  of 
p  into  the  .road  domain  of  p©.  We  can  eee  thie  point  from  fieure  6 
equation  9  . . 


and  p.  f/hen 
iij  much  lar 


dotermino  relationehip  between  etorod  pattern  number  M 
vG(p)  becomes  large  with  p,  the  second  term  of  equation  (9) 


lar  or  than  the  first  one.  Equation  becomes 


G{p)-u;(M-l)“(2p-I)“ 


The  conditi  a  that  \/G(p)  is  larger  than  N/2  becomes 

!  P  q  I  <  l/f2(M-l)J  (12) 

Figure  3  hc:  also  given  this  relationship. 


Fig.  5  p  versus  po 


3.  Conclusion 

This  paper  laalyces  the  performance  of  neural  network  loop  and  obtain 
the  concluGion  that  although  in  a  narrow  domain  of  p  the  network  can 
converges  tc  correct  patterns,  in  a  broad  doamin  of  po  the  network  can 
converges  to  correct  one.  This  result  quarantees  the  convergence  of  the 
network. 
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Abstract:  This  paper  introduces  a  neural  unit,  similar  to  a  sigma-pi  unit,  that  can  learn  and 
generalize  linearly  inseparable  binary  input  vectors.  Learning  effectively  decides  a  higher-order 
polynomial  suitable  to  the  problem  being  trained.  The  unit  generalizes  in  accordance  with  the  rela¬ 
tion  specified  by  that  polynomial,  and  hard  problems  like  the  parity  problem  are  generalized  easily. 
In  training  the  neural  unit,  a  gradient-descent  based  supervised  learning  algorithm  is  adopted. 

1.  INTRODUCTION 

The  threshold  mechanism  in  a  McCulloch  and  Pitts  neuron  is  not  the  only  nonlinearity  that  plays 
an  important  role  in  information  processing  in  the  brain.  Over  the  years,  a  substantial  body  of  ev¬ 
idence  has  grown  to  support  the  presence  of  nonlinear  synaptic  connections  and  multiplicative-like 
operations.  We  refer  the  reader  to  the  review  article  by  Koch  and  Poggio  [1].  The  introduction  of 
polynomial  or  sigma-pi  units  [2][3]  have  motivated  the  research  to  investigate  the  computational 
abilities  of  neural  units  with  nonlinear  synaptic  connections.  The  output  of  a  polynomial  unit  is  a 
function  of  the  linear  sum  of  some  monomials,  where  each  monomial  is  the  product  of  some  number 
of  inputs  Xi  and  a  weight  parameter  e.g.  wxlxsx^.  It  has  been  argued  that  networks  based  on 
sigma-pi  units  may  be  more  powerful  and  have  other  advantages  with  respect  to  the  more  traditional 
threshold-based  networks  [4][5].  The  backpropagation  algorithm  is  commonly  adopted  to  train  the 
polynomial  networks  [2][6]  and  usually  the  terms  upto  second-order  are  used  and  higher-order  terms 
are  ignored.  However,  for  many  problems  like  the  parity  problem,  higher-order  terms  play  the  most 
decisive  role  and  cannot  be  ignored.  Even  though  invariance  properties  may  be  used  for  certain 
problems,  in  general,  learning  algorithms  do  not  specify  which  of  the  higher-order  monomials  are 
the  most  relevant  ones  and,  therefore,  to  be  taken  into  consideration  for  the  problem  in  hand. 

In  attempt  to  obtain  a  neural  unit  that  can  do  nonlinear  separation,  another  approach  known 
as  the  Gaussian  potential  function  network  (GPFN)  has  been  proposed  [7].  GPFN  is  capable  of 
performing  forward  mappings  as  a  pattern  classifier  and  approximates  an  arbitrary  many-to-one 
continuous  function  by  a  potential  field  synthesized  over  the  domain  of  the  input  space  by  a  number 
of  Gaussian  computational  units  called  Gaussian  Potential  Function  Units  (GPFU’s).  The  synthesis 
of  a  potential  field  is  accomplished  by  learning  the  location  and  shape  of  individual  GPFU’s,  as  well 
as  determining  the  minimum  necessary  number  of  GPFU’s  via  a  gradient-descent  based  supervised 
learning  algorithm. 

The  reason  for  much  of  the  excitement  about  neural  networks  is  their  ability  to  generalize  to  new 
situations,  and  a  neural  network  that  is  efficient  in  learning  is  not  necessarily  good  at  generalizing. 
After  being  trained  on  a  number  of  examples  of  a  relationship  that  interpolates  and  extrapolates 
from  the  examples  in  a  sensible  way.  But  what  is  meant  by  sensible  generalization  is  often  not 
clear.  How  does  a  neural  network  -or  a  human  for  that  matter-  choose  the  ‘right’  one  among  al¬ 
most  infinitely  many  possible  generalizations?  As  an  example,  one  could  train  a  neural  network 
with  six  or  seven  of  the  eight  parity  relations  in  three  dimensions,  and  it  would  be  very  unlikely 
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that  any  of  the  known  type  of  networks  would  actually  generalize  to  full  parity.  A  child  with  a 
reasonable  IQ  will,  however,  discover  the  relation  as  multiplication,  or  as  evenness  or  oddness,  and 
generalize  correctly.  Perhaps,  discovering  a  relation  through  learning,  and  generalizing  accordingly 
is  the  sensible  generalization,  and  that  is  what  the  neural  unit  we  introduce  does  for  binary  inputs. 

2.  THE  MODEL  NEURAL  UNIT 

The  infinite  polynomial  sum  representing  the  postsynaptic  polarization  potential  of  a  sigma-pi  unit 
[2]  can  be  written  as  a  finite  sum  in  case  of  the  binary  inputs  as  follows: 

N  N-k+1  N-k+2  N 

<i>iX)=UO+Y^  ^  H  •••  Y.  (1) 

k=l  ii=l  J2=ji+1 

Here,  N  is  the  dimension  of  the  input  vectors  X  =  {xi,X2,-  ■  ■,xn),  and  w’s  are  the  weight 
parameters.  The  value  of  ij  (t  =  1,2, ...,iV)  is  either  -fl  or  -1.  A  term  like  x^x^xr  in  the 
infinite  polynomial  has  been  absorbed  into  the  term  X2Xr  since  each  x,  is  binary  and  that  is 
what  made  it  possible  to  obtain  a  finite  polynomial.  For  example,  for  AT  =  2,  Eq.  (1)  yields 
4>{X)  =  Wo  +  WiXi  +  W2X2  +  aJi2XiX2. 

The  output  of  the  neural  unit,  S(X),  is  binary  (-1-1  or  —1)  and  given  as: 

S(X)  =  sgn(^(X))  (2) 

S(X)  will  not  be  affected  if  we  ignore  some  of  the  weights,  and  the  corresponding  terms,  with 
relatively  smaller  absolute  values  in  </>(X),  which  is  important  since  <^(X)  contains  2^  weight 
parameters  and  naturally,  we  do  not  want  to  compute  all.  But  how  do  we  detect  the  most  relevant 
terms,  that  is  the  terms  with  relatively  larger  absolute  values? 

</>(X)  can  be  written  as  a  linear  sum  of  some  "product  terms”  as  follows: 

L 

4>{X)  =  +  Mixi)iHi  +  Mix2) . . .  -b  M^xa^)  (3) 

i=i 

Here,  and  Mj^  are  some  coefficients  to  be  determined  and  L  is  some  finit '  integer.  Note  that 
a  product  term  contains  all  the  terms  in  Eq.  (1)  when  it  is  expanded,  but  si  f  the  weights  in 
that  expansion  are  dependent  on  the  others.  However,  the  sign  of  4>{X)  is  impon-ant,  not  its  exact 
value,  amd  this  provides  some  degrees  of  freedom.  If  more  degrees  of  freedom  is  required,  then  we 
add  new  product  terms,  that  is  we  increase  L,  which  releases  some  of  the  dependent  weights. 

In  order  to  compute  the  coefficients  and  MjJ  we  define  a  cost  function  as: 

£  =  (“) 

p 

where  the  sum  is  over  the  training  patterns.  Xp  and  dp  denote  the  training  pattern  p  and 
its  desired  output  respectively.  The  cost  function  is  minimized  using  the  gradient-descent  method 
which,  for  pattern  p  implies: 


AJ2 

£^Mi 


(5) 
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where  k  =  1,2, Initially,  random  values  are  assigned  to  and  M^.  7  is  a  parameter 
set  in  the  simulations  as  7  =  0.1  if  sgn{4>{Xp))  =  dp,  7  =  1  otherwise.  It  has  the  effect  of  focusing 
on  getting  the  sign  of  ^(X)  right  before  paying  attention  to  the  magnitude.  The  learning  rate  rf  is 
not  taken  as  a  constant  but,  instead,  is  adjusted  automatically  during  the  learning  process  as: 

{+a  if  A£  <  0  consistently 

—brj  if  AE  >  0  (6) 

0  otherwise 

where  a  and  b  are  appropriate  constants.  The  meaning  of  consistently  is  based  on  last  K 
steps.  In  the  simulations,  a,  6  and  K  are  set  as  a  =  0.05,  b  =  0.3  and  K  =  10.  Such  an  automatic 
adjustment  of  ij  has  been  proposed  by  various  authors,  e.g.  [8][9],  for  tl;o  backpropagation  algorithm, 
which  we  have  adopted  here. 

The  value  of  d>(X)  is  unbounded  which  may  lead  ”blow-ups”  during  the  learning  process.  We 
have,  however,  avoided  this,  imposing  a  constraint  as  <  1  and  \Ml\  <  1,  during  the  simulations. 

It  is  important  that  we  add  new  product  terms  gradually.  That  is,  we  start  with  one  product 
term  and  if  after  certain  number  of  cycles  the  error  E  does  not  fall  below  a  required  limit  then 
add  few  more  (one  or  two  not  twenty)  product  terms  and  apply  the  algorithm  again.  As  far  as 
the  learning  is  concerned,  there  is  no  need  to  the  gradual  addition  of  the  product  terms;  in  any 
case  the  neural  unit  will  learn.  However,  if  new  terms  are  not  added  gradually  we  may  get  a  dif¬ 
ferent  generalization.  A  trivial  example  is  the  XOR  problem.  If  we  train  three  of  the  four  patterns 
using  two  product  terms,  then,  depending  on  the  random  initial  values  of  the  coefficients  in  the 
algorithm,  it  may  generalize  the  fourth  pattern  to  -j-1  or  —1.  But  if  one  product  term  is  used  then 
the  fourth  pattern  will  be  generalized  to  the  full  XOR.  This  is  because,  with  two  product  terms, 
even  if  £  =  0,  the  degree  of  freedom  is  sufficient  that  there  exists  two  different  set  of  coefficients 
(one  corresponding  to  the  full  XOR  and  the  other  corresponding  to  the  linearly  separable  solution) 
which  can  accomodate  the  three  patterns  that  are  trained.  Hence,  gradual  addition  of  the  product 
terms  effectively  detects  the  most  relevant  terms  mentioned  earlier,  and  forces  the  neural  unit  to 
find  a  relation  as  simple  as  possible. 

3.  SIMULATIONS 

In  the  simulations,  after  the  addition  of  a  new  product  term,  all  the  coefficients  are  set  to  random 
values  and  1000  cycles  of  training  is  applied.  If  after  1000  cycles  the  error  does  not  fall  below  the 
required  limit  another  product  term  is  added  and  the  same  steps  are  repeated.  Therefore,  a  pat¬ 
tern  set  that  employed  L  product  terms  is  trained  for  lOOOL  cycles  maximum.  There  is  a  trade-off 
between  the  number  of  product  terms  and  the  error  limit.  We  have  taken  the  error  limit  as  0.05Wp, 
where  Np  is  the  number  of  training  patterns,  good  enough  for  sensible  generalization. 

Learning 

2^  distinct  vectors  in  N  dimensions  with  various  random  desired  outputs  are  taken  and  the  neural 
unit  is  tested  upto  iV  =  7.  It  has  learned  all  of  the  2^  input  vectors  completely.  The  number  of  the 
product  terms  it  has  employed  varied  a  lot,  as  expected,  depending  on  the  desired  outputs.  The 
maximum  number  of  product  terms  employed  in  2, 3, 4, 5, 6  and  7  dimensions  are  4, 4, 5, 6, 9  and  17 
respectively.  These  numbers  are,  in  fact,  higher  than  they  should  be.  For  example,  for  A  =  2  the 
majcimum  number  should  be  2.  We  interpret  this  as  an  artifact  of  the  gradient-descent  method’s 
local  minima  problem.  However,  note  that  the  storage  requirement  can  be  eased  up  ignoring  some 
coefficients,  without  affecting  S{X),  that  are  close  to  0. 

Generalization 

Example  1:  Parity  problem  deserves  a  special  attention  as  it  is  usually  considered  as  the  most 
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*1 

X2 

X3 

dp 

1 

-1 

1 

-> 

-1 

-1 

1 

-1 

-♦ 

1 

-1 

-1 

1 

1 

-1 

1 

1 

-1 

1 

1 

-1 

-1 

1 

-1 

-1 

-> 

1 

Table  1:  IVaiiiing  pattern  set  used  for  the  parity  generaJization  in  3Z> 


Xi  X2  X3  X4 _ ^ 

-1  -1-1  1  -1 

-1  1-1  1  1 

-1-1  1  1  1 

1-1  1  -1  1 


Table  2;  Training  pattern  set  for  the  parity  generalization  in  4D 

difficult  problem  for  most  of  the  existing  models  and  usually  cannot  be  generalized.  The  neural 
unit  is  trained  with  the  3D  pattern  set  shown  in  Table  1. 

The  neural  unit  discovered  the  parity  relation  with  one  product  term  as 

,^(X)  =  (-0.01  -I-  0.99ii)(0.03  -  0.97a;2)(0.06  -  0.99x3) 

S  0.95x1X2X3 

hence  generalized  aU  the  untrained  input  vectors  accordingly.  Note  that,  although  the  problem 
is  a  linearly  separable  one,  the  parity  relation  is  discovered  by  the  neural  unit  because  it  requires 
just  one  monomial  whereas  the  linearly  separable  solution  requires  three  monomials.  Similarly,  the 
neural  unit  learned  to  generalize  to  the  4D  parity  relation  after  training  it  with  the  4D  pattern  set 
shown  in  Table  2. 

Example  2:  The  problem  considered  here  is  the  discovery  of  the  Boolean  function  /(X)  = 
(xi  V  X2)  A  (x3  V  X5)  where  V  and  A  represent  conjunction  (binary  or)  and  disjunction  (binary  and) 
respectively,  X4  being  redundant  input.  The  neural  unit  is  trained  with  25  of  the  32  patterns  and 
formed  the  polynomial 

<M.X)  =  (0.79  -  0.01xi)(-0.90  -  0.33x2)(-0.89  -  0.31x3)(0.92  -|-  0.01x4)(0.83  +  0.28x5)  + 
(0.32  -  0.49xi)(0.35  -  0.48x2)(-0.51  -  0.84x3)(-0.63  -h  0.18x4)(-0.31  +  l.OOxs)  + 
(-0.78  -  0.10xi)(0.95  -I-  0.14x2)(0.49  -  0.84x3)(0.88  -  0.04x4)(0.49  -  0.86x5)  -|- 
(-0.51  +  0.94xi)(-0.58  -t-  0.57x2)(-1.00  -  0.23x3)(-0.86  -  0.09x4)(-0.89  -  0.43x5) 

^  0.08  +  0.41x1  +  0.39x2  +  0.37x3  -f  0.36x5 

— 0.44xiX2  +  0.16x1X3  -|-  0.18x2X3  -h  0.18x1X5  -f  0.16x2X5  -  0.38x3X5 
-0.13x1X213  -  0.12x1X2X5  -  0.10x1X3X5  -  0.11x2X3X5  +  0.07x1X2X3X5 
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Xl 

X2 

X3 

X4 

dp 

-1 

-1 

1 

1 

-f 

-1 

1 

-1 

1 

-1 

-► 

1 

1 

-1 

1 

1 

-► 

-1 

-1 

-1 

-1 

-1 

-1 

1 

1 

1 

1 

-1 

-1 

1 

1 

-1 

1 

1 

1 

-1 

-1 

-1 

1 

1 

1 

-1 

-1 

1 

1 

-1 

1 

— ¥ 

-1 

-1 

1 

-1 

1 

1 

Table  3:  Training  pattern  set  in  4D 

where  in  the  approximation  monomials  with  weights  less  thaji  0.05  in  absolute  value  are  omitted, 
which  does  not  affect  S{X).  Hence,  S{X)  is  independent  of  X4,  as  it  should  be.  The  neural  unit 
generalized  the  remaining  7  patterns  in  consistency  with  the  Boolean  function  f{X). 

Example  3:  The  neural  unit  is  trained  with  the  10  patterns,  shown  in  Table  3,  that  are  generated 
using  the  Boolean  function  f(X)  =  (xi  0  X2)  A  (*3  0  *4)  where  0  represents  the  XOR  operation. 
The  neural  unit  comes  up  with  the  relation 

<f>{X)  =  (-0.95  +  0.06xi)(-0.94  +  0.82i2)(0.07  +  0.75x3)(0.04  -  0.77x4)  + 

(-0.05  -  0.93xi)(-0.64  -  0.62x2)(-0.93  +  0.00x3)(0.89  -  O.OIX4) 

^  -0A9xi  -  0.48xiX2  -  0.52x3X4  +  0.45x2X3X4 
^  0.5(-xi  -  X1X2  -  X3X4  +  X2X3X4) 

=  -0.5xi(l  +  X2)  -  0.5x3X4(1  -  X2) 

which  can  be  interpreted  as: 

<KX)  =  I 

and  is  represented  by  the  Booleam  function: 

<f>{X)  =  (x"i  A  X2)  V  ((X3  0  X4)  A  X2) 

which  is  a  simpler  relation  than  the  one  used  in  generating  the  pattern  set. 

4.  CONCLUSION 

In  this  paper  we  have  introduced  a  binaiy-input  supervised  neural  unit  that,  through  learning, 
forms  higher-order  synaptic  correlations  expressed  by  a  polynomial.  Consequently,  it  can  learn  and 
generalize  linearly  inseparable  input  vectors.  The  gradient-descent  based  learning  algorithm  is  such 
that  the  terms  in  the  polynomial  that  reflect  an  existing  relation  in  the  input  vectors  are  highlighted 
(i.e.  assigned  higher  absolute  weight  values),  and  then  generalization  is  done  in  accordance  with 
that  existing  relation  discovered.  We  do  not,  however,  think  that  the  learning  algorithm  used  is 
the  most  efficient  algorithm  one  can  find,  and  we  are  currently  working  on  this  point. 


— X3X4  if  X2  =  —  1 
-xi  if  X2  =  -1-1 
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Abstract 

An  adaptive  tessellation  variant  of  the  CMAC  architecture  is  introduced.  Adaptive  tessellation  is 
an  error^ased  scheme  for  distributing  input  representaticms.  Simulations  show  that  d\e  new  network 
ou^rforms  the  original  CMAC  at  a  variety  of  learning  tasks,  including  learning  the  inverse  kinematics  of 
a  two-link  arm. 

1  Introduction 

The  cerebellar  model  articulation  controller  (CMAC)  is  a  supervised  learning  algorithm  inspired  by  the 
architecture  of  the  cerebellum  [1, 2].  It  has  been  successfully  applied  to  a  number  of  tasks  which  require  a 
quick,  computationally  efficient  algorithm.  For  example.  Miller  and  others  have  used  CMAC  for  various 
robotic  control  tasks  [9, 10, 11]. 

CMAC  is  essentially  a  continuous  valued  perceptron:  a  general  model  of  neural  learning  and  performance 
[12].  A  perception  consists  of  three  layers  of  neurons:  sensory  (S),  association  (A)  and  response  (R).  Nodes  in 
each  layer  compute  a  weighted  sum  of  their  inputs.  Sensory  nodes  (S-nodes)  transduce  input  signals,  which 
activate  Association  nodes  (A-nodes)  through  a  fixed  mapping.  A-nodes  activate  response  nodes  through 
modifiable  weights  to  generate  output.  Error  signals  drive  weight  modification  between  the  A  and  E  layers 
using  the  least-mean  squares  law  or  delta  rule  [16]. 

Since  the  perception  calculates  a  linear  transform  of  input  to  output,  there  are  many  mappings  that 
cannot  be  represented  by  a  given  perception,  in  particular,  those  that  are  not  linearly  separable.  However, 
it  is  possible  to  increase  the  utility  of  a  perceptron  by  using  specific  S-A  mappings  given  a-priori  knowledge 
of  the  mapping. 

CMAC  is  an  instance  of  a  perception  that  implements  a  specific  S-A  mapping  called  expansion-recoding. 
In  expansion-recoding  each  input,  represented  as  a  pattern  of  activity  over  the  S-nodes,  activates  a  fixed 
subset  of  the  A-nodes.  These  subsets  are  chosen  so  that  an  equal  number  of  A-nodes  are  activated  for  every 
input  (expansion)  and  so  that  nearby  inputs  activate  overlapping  subsets  (generalisation).  The  number  of 
A-nodes  activated  by  any  particular  input  determines  the  granularity,  or  resolution  of  the  mapping  and 
the  overlap  between  adjacent  inputs  determines  how  smoothly  this  mapping  changes  and,  therefore,  the 
amoimt  of  generalisation.  The  choice  of  the  S-A  mapping  instantiates  a  hypothesis:  that  the  complexity  or 
non-linearity  of  the  target  mapping  is  such  that  it  can  be  adequately  represented  by  the  chosen  expansion¬ 
recoding. 

This  paper  will  refer  to  the  choice  of  particular  S-A  mapping  as  the  choice  of  a  tessellation  scheme  for  the 
input  space.  Figure  1  shows  schematically  possible  tessellation  schemes  for  a  two-dimensional  input  space. 


2  Adaptive  Tessellation 

The  choice  of  an  S-A  mapping  is  critical  for  the  performance  of  a  perception.  One  might  therefore  ask 
whether  such  a  mapping  must  be  fixed  or  whether  it  can  be  altered  during  the  course  of  learning.  If  the 
mapping  can  be  selected  automatically  during  training,  the  designer  need  not  choose  the  input  quantisation 
and  generalisation. 

tSupported  in  part  by  ARPA  CM'IR  N00014-92-J-4015,  NSF IRI-90-00530,  ONR  N00014-91-J-4100,  A  Boston  University  Presidential 
Graduate  Fdlowship. 
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Figuic  1:  Possible  input  space  tessellations.  Each  circle  represents  a  portion  of  the  input  space  which  activates 
an  association  node.  The  scheme  on  the  left  is  appropriate  for  a  mapping  which  changes  rapidly  and  in  which 
generalisation  is  not  required,  the  middle  scheme  is  appropriate  for  a  nwre  uniform  mapping  in  which  outputs  can 
be  generalised  between  adjacent  areas  of  the  input  space  and  the  adaptive  scheme  on  the  right  demonstrates  a 
non-uniform  distribution  appropriate  for  a  non-uniform  mapping. 

To  approximate  such  a  function  to  a  required  level  of  accuracy,  CMAC  must  allocate  enough  nodes  to 
quantise  the  entire  input  space  at  the  degree  of  granularity  required  by  the  most  complex  region.  An  adaptive 
tessellation  algorithm  is  one  in  which  the  number  of  nodes  and  their  position  in  input  space  varies  during 
training.  Such  an  algorithm  can  approximate  a  function  more  accurately  with  less  nodes  by  allocating  those 
nodes  in  accordance  with  the  structure  of  the  function. 

There  are  many  ways  to  perform  an  adaptive  tessellation  and  the  scheme  described  here  was  chosen  not 
for  optimality  but  to  show  that  a  simple  heuristic  can  be  of  use.  In  this  scheme  A-nodes  are  allocated  during 
training  to  regions  in  which  the  error  is  high.  The  network  is  initialised  with  a  single  node  which  represents 
the  entire  input  space.  After  some  period  of  training  this  node  is  split  into  a  number  of  sub-nodes.  The 
sub-nodes  are  further  broken  down  at  regular  intervals  during  the  training.  Thus,  the  network  operates  on 
two  time  scales:  a  short  time  scale  corresponding  to  the  regular  CMAC  training  schedule  and  delta  rule, 
and  a  long  time  scale  corresponding  to  the  allocation  of  new  input  representations. 


Figure  2:  Functional  approximations  produced  by  two  networks,  CMAC  (left)  and  and  adaptive  tessellation  variant 
(right).  The  function  to  be  approximated  is  a  sinusoid  in  part  of  its  domain  and  constant  in  other  parts.  Both  networks 
utilise  same  number  of  A-nodes  (1 00).  The  location  of  the  A-nodes  (in  the  input  space  or  functional  domain)  are  marked; 
note  that  in  the  adaptive  network  they  are  sparse  in  the  uniform  regions  and  more  concentrated  in  the  sinusoidal  regions. 
Also,  within  each  sinusoidal  region,  the  distribution  is  biased  by  the  gradient  of  the  cun/e. 

The  node  to  be  split  is  chosen  by  considering  the  accumulated  error  statistics  over  all  nodes  and  locating 
the  source  of  the  highest  error.  The  result  of  this  process  is  a  node  distribution  that  is  concentrated  in  regions 
where  the  error  was  highest  during  training.  Figure  2  shows  an  example  of  such  a  distribution  for  a  function 
which  is  uniform  over  part,  but  not  all,  of  its  domain. 


3  Simulation:  Inveise  Kinematics 


Learning  the  inverse  kinematics  of  a  two-link  arm  is  a  simple  problem  that  illustrates  some  of  the  advantages 
and  disadvantages  of  the  new  architecture  in  comparison  to  the  original  CMAC  Assume  that  an  arm  is 
moved  randomly  by  assigning  its  two  joint  angles;  the  resultant  position  of  its  end-effector  is  used  as  input 
and  the  joint  angles  as  target  values  for  learning.  Such  a  scheme  has  been  postulated  to  form  a  part  of  motor 
learning  [5, 8].  The  problem  itself  is  simple;  of  interest  is  the  accuracy  that  can  be  achieved  with  a  limited 
representation  and  without  extensive  domain  knowledge. 

In  order  to  compare  performance  parameters  were  selected  that  produced  reasonable  performance  from 
both  networks,  although  these  parameters  were  not  optimal  for  either.  All  parameters  were  kept  constant 
and  the  same  learning  rates,  number  of  nodes,  number  of  trials  and  data  were  used  for  both  networks.  Three 
levels  of  generalisation  were  used  for  CMAC;  no  generalisation  was  performed  in  the  adaptive  tessellation 
network.  Figure  3  shows  the  time  course  of  network  errors  during  training  of  both  networks.  It  can  be  seen 
that  the  adaptive  network  reaches  a  lower  average  error  than  the  conventional  CMAC  network  at  any  level 
of  generalisation. 

The  learning  task  in  this  simulation  was  kept  as  simple  as  possible  by  restricting  the  movement  of  the 
arm  in  such  a  way  that  only  a  portion  of  the  input  space  was  spanned.  This  subspace  was  chosen  so  as 
to  avoid  singularities  associated  with  a  complete  revolution  of  the  arm  and  within  it  only  a  single  arm 
configuration  was  associated  with  each  end-effector  position. 


Figure  3:  Average  (left)  and  maximum  (right)  mean  squared  error  from  networks  trained  on  the  two-link  arm  inverse 
kinematics  problem.  Each  graph  compares  performance  of  the  adaptive  tessellation  network  with  CMAC  networks 
utilising  three  different  levels  of  generalisation.  Each  generalisation  level  specifies  how  many  A-nodes  are  activated  by 
each  input. 


However,  the  mapping  to  be  learned  still  involves  some  complexities.  One  is  a  singularity  at  the  origin 
of  the  arm,  where  multiple  positions  of  the  first  joint  correspond  to  a  single  location.  Another  is  the  increase 
in  the  effect  of  a  joint  angle  change  at  extreme  distances  from  the  origin.  Both  of  these  factors  influence 
the  distribution  of  input  representations  in  the  adaptive  network,  as  can  be  seen  in  Figure  4.  Note  that  this 
distribution  does  not  simply  reflect  the  input  distribution  —  it  also  takes  into  account  the  distribution  of 
errors. 


4  Discussion 

Adaptive  tessellation  has  been  identified  as  an  error  based  mechanism  for  varying  the  mapping  between 
sensory  and  association  units  in  a  perceptron.  Other  models  also  perform  such  variations,  for  example, 
back-propagation  [13].  If  adaptive  tessellation  lies  on  a  spectrum  of  flexibility  in  input  mappings,  at  one 
end  of  whidi  are  simple  perceptrons  using  fixed  linear  mappings  and  at  the  other  end  of  which  are  models 
such  as  back-propagation,  which  allow  continuous  modification  of  all  hidden  unit  (A-node)  weights,  then 
it  should  be  located  somewhere  in  the  middle  of  this  range. 
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Figuic  4:  Distribution  of  input  vectors  (left)  and  A-nodes  (right)  from  the  adaptive  tessellation  network  trained  on  the 
inverse  kinematics  problera  Each  input  vector  represents  a  point  in  the  two  dimensional  space  around  the  arm  centre, 
assumed  to  be  located  at  the  origin.  Since  the  arm  links  are  each  one  unit  long,  the  arm  can  reach  any  point  lying 
within  a  two  unit  radius  circle  of  the  origin,  but  its  motion  s  restricted  by  generating  joint  angles  in  the  range  zero  to  t. 
The  input  distribution  shows  the  result  of  a  uniform  random  selection  of  joint  angles  from  this  range.  The  distribution  of 
A-nodes  is  influenced  by  the  input  distribution  but  is  more  heavily  concentrated  around  the  origin  and  the  edges  of  the 
workspace,  as  these  are  the  regions  that  generate  high  errors.  The  same  number  of  points  are  shown  in  each  plot. 


Like  back-piopagatioit,  adaptive  tessellation  CMAC  distributes  its  A-nodes  (hidden  layer  units)  in 
accordance  with  the  error  statistics.  However,  in  order  to  do  so,  it  utilises  a  heuristic  rather  than  the 
generalised  delta  rule.  Similar  heuristics  have  been  found  useful  in  conjunction  with  back-propagation,  for 
example,  the  cascade  correlation  algorithm  [4]  is  a  back-propagation  variant  which  allocates  new  (hidden) 
units  while  the  error  remains  above  a  given  threshold. 

The  use  of  a  heuristic  implies  applicability  to  particular  classes  of  problems.  Adaptive  tessellation  is 
particularly  suited  to  the  learning  and  representation  of  functions  which  contain  regions  of  uniformity  and 
regions  of  high  variation.  It  is  less  well  suited  to  functions  which  vary  continuously. 

Self-organising  feature  maps  (SOFM)  (4  15]  employ  a  heuristic  which  performs  an  adaptive  tessellation 
using  cooperative  interactions  between  nodes.  However,  this  tessellation  is  based  on  the  relative  densities 
of  inputs,  rather  than  an  error  measure,  resulting  in  greater  concentration  of  nodes  solely  around  areas 
in  which  inputs  are  frequently  sampled.  A  variant  of  the  SOFM  algorithm  uses  error-based  distribution 
of  nodes  and,  like  adaptive-tessellation  CMAC,  breaks  down  existing  nodes  when  errors  are  high  within 
the  regions  represented  by  those  nodes  [14].  This  model  uses  a  mixture  of  techniques:  it  follows  the 
conventional  ^FM  algorithm  to  distribute  nodes  according  to  the  input  space  densities,  then  replaces  each 
node  which  performs  at  level  worse  than  a  Hxed  threshold  by  a  new  set  of  nodes.  The  system  is  used  for 
classification  and  performance  is  measured  by  analysing  the  number  of  incorrect  classifications  made  by  a 
given  node. 

Another  system  which  utilises  a  heuristic  similar  to  that  employed  by  adaptive  tessellation  is  ARTMAP 
[3].  ARTMAP  allocates  new  categories  specifically  when  no  satisfactory  category  exists,  Le.,  when  the 
existing  classification  scheme  would  incorrectly  classify  an  input.  The  new  category  is  placed  at  the  location 
of  the  input  vector  which  caused  the  error. 

So  adaptive  tessellation  can  be  viewed  as  a  particular  heuristic  technique  for  choosing  an  input  repre¬ 
sentation.  It  is  more  powerful  and  more  complex  than  the  fixed  representation  used  in  conventional  CMAC 
but  simpler  and  faster  than  techniques  such  as  back-propagation.  Although  it  sacrifices  the  easy  array 
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implementation  that  CMAC  employs,  it  can  be  implemented  in  a  computationally  efficient  manner  (see 
Appendix). 

Performance  of  the  adaptive  tessellation  network  could  be  improved  by  implementing  some  form  of 
generalisation.  Linear  generalisation,  implemented  by  averaging  in  CMAC,  would  allow  the  network  the 
approximate  functions  with  linear  regions.  More  complex  curve  fitting  could  be  performed  by  using  the 
nodes  to  represent  points  on  a  spline  or  other  curve  types.  As  always,  the  optimal  generalisation  strategy 
will  depend  on  the  shap>e  of  the  function;  the  best  jjerformance  will  always  be  achieved  by  selecting  a 
generalisation  strategy  that  is  based  on  a  priori  knowledge  of  the  function.  TTie  variation  in  size  of  regions 
represented  by  A*nodes  in  the  adaptive  tessellation  architecture  would  automatically  restrict  generalisation 
to  small  regions  where  errors  are  high  and  widen  it  where  errors  are  low. 


Appendix:  The  Adaptive  Tessellation  Algorithm 

Input  and  Output:  Input  is  a  series  of  M-dimensional  real  valued  vectors,  i.  Output  is  an  Af -dimensional 
vector,  6.  Input  and  output  vectors  used  for  training  are  assumed  to  be  drawn  from  a  function  to  be 
approximated,  f :  M  -*  N. 

Data  Structure:  Although  the  network  architecture  could  be  implemented  in  many  different  ways,  all 
simulations  described  in  this  paper  were  implemented  using  a  tree  data  structure.  Die  tree  is  composed 
of  two  types  of  nodes:  output  nodes  and  split  nodes.  Each  output  node  stores  an  output  vector  x  and  an 
accumulated  error  measure  e.  Each  split  node  stores  an  input  space  weight  vector  w. 

Every  split  node  has  2^  branches.  Each  branch  represents  a  rectangular  region  of  the  input  space  one 
side  of  which  is  defined  by  the  weight  vector  of  the  parent  and  the  other  sides  of  which  are  implicitly  defined 
by  neighbouring  regions.  Thus,  a  split  node  divides  the  input  space  into  2*^  hypercubes  around  the  point 
represented  by  its  weight  vector.  The  split  nodes  define  the  distribution  of  input  representations. 

Output  nodes  are  always  found  at  the  leaves  of  the  tree,  i.e.,  have  no  children.  They  correspond  to  the 
A-nodes  of  the  network;  their  output  vectors  are  the  weight  vectors  connecting  the  A  layer  to  the  R  layer. 

Initialisation:  The  tree  is  initialised  by  constructing  a  single  output  node  at  the  root  which  is  initially 
responsible  for  the  entire  input  space.  Its  output  vector  is  initialised  randomly. 

Output:  Each  input  is  classified  by  traversing  the  tree,  comparing  the  input  vector  to  the  weight  vectors 
of  the  split  nodes.  The  appropriate  branch  of  the  tree  is  followed  and  further  comparisons  made  if  another 
split  node  is  encountered.  This  process  is  continued  until  an  output  node  (leaf)  is  reached,  whereupon  the 
associated  output  vector  x  becomes  the  network  output. 

Training:  The  network  is  trained  by  presenting  it  with  a  series  of  input/output  pairs,  i  and  o.  Each  of  these 
pairs  is  assumed  to  represent  an  instance  of  the  mapping,  i.e.,  /(»)  =  o.  The  network  output  in  response  to 
1  is  calculated  as  described  above,  call  this  x.  The  error  measure,  6  -  i,  is  used  to  update  the  output  vector 
of  the  unit  which  responded  to  the  input  using  the  delta  rule: 

*'  =  *  +  »?  (o  -  *) 

where  ij  is  a  learning  rate.  The  summed  square  error,  ~  ®«)^'  added  to  the  accumulated  error 

associated  with  the  output  node,  e. 

After  a  certain  time  period,  the  tree  is  reorganised  and  new  split  and  output  nodes  added.  Each  such 
reorganisation  allocates  one  new  split  node  and  2^  new  output  nodes  at  the  place  in  the  input  space  where 
error  was  maximal,  i.e.,  the  output  node  with  the  greatest  e.  The  new  split  node  replaces  this  output  node 
and  has  2^  new  output  node  children,  each  one  of  which  inherits  the  output  vector  of  the  replaced  output 
node.  All  error  statistics  are  zeroed  after  this  splitting  process. 
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Abstract: 

A  fast  learning  algorithm  which  can  adaptively  decide  the  architecture  and  synaptic  weights 
of  a  neural  network  for  any  training  set  is  presented  here.  It  aims  to  use  least  hidden  nodes 
in  only  one  hidden  layer  and  map  the  input-output  patterns  within  any  required  precision. 
For  any  N-pattem  training  set,  a  maximum  of  N-r  bias  nodes  are  enough  to  learn  all  the 
patterns  within  required  precision  (r  is  the  rank,  usually  the  number  of  dimensions,  of  the  input 
patterns).  In  this  algorithm,  the  inverse  of  activation  function  is  applied  to  the  output  data, 
thus  the  nonlinear  part  of  the  output  layer  is  traversed  and  can  be  ignored  in  the  succeeding 
learning  process.  Then  we  try  to  map  the  input  data  linearly  to  the  traversed  output  data.  If 
the  mapping  has  greater  error  than  required,  then  we  add  a  hidden  bias  node,  append  each 
input  pattern  with  the  corresponding  output  of  the  added  node,  thus  to  increase  the  rank  of  the 
updated  input.  Pseudoinverse  is  applied  to  achieve  least  square  error  linear  mapping.  The 
process  of  adding  a  hidden  bias  node  is  repeated  until  required  mapping  precision  is  achieved. 

1.  Introduction: 

Supervised  Learning  Neural  Network  is  a  feed  forward  network.  It  picks  up  an  input 
pattern,  feeding  forward  (usually  via  some  hidden  layers),  and  achieves  corresponding  output 
pattern  on  the  output  layer. 

We  call  it  supervisediearning  because  we  desire  to  get  specific  output  patterns  from  it 
for  some  input  patterns.  These  input-ouput  patterns  constitute  the  training  set. 

We  can  see  there  are  only  two  types  of  operators  in  a  neural  network:  matrix  operators 
W’s  and  a  nonlinear  activation  operator.  In  this  paper,  we  use  the  following  nonlinear 
activation  operator: 

e®  -  e'® 

<r  :  X  - . 

e®  +  e-® 

We  expect  the  surchitecture  and  learning  algorithm  of  this  type  of  networks  to  : 
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1.  Learn  all  the  patterns  in  the  training  set  quickly  with  small  error; 

2.  Adaptively  decide  the  architecure  and  do  not  use  many  hidden  layers  and  nodes; 

3.  Predict  the  outputs  of  other  patterns; 

4.  Can  be  easily  implemented  into  electronic  circuits,  etc. 

This  neural  networks  model  has  great  potential  for  applications  in  various  areas.  But 
up  to  now,  we  still  don’t  have  an  architecture  or  learning  algorithm  that  can  satisfy  these 
expectations.  Especially,  slowness  of  learning  and  difficulty  to  decide  a  proper  architecure 
are  major  drawbacks  to  those  who  wish  to  apply  Neural  Networks. 

In  the  method  presented  here,  at  least  we  can  satisfy  expectations  1  and  2  very  well.  We 
can  decide  a  Neural  Network  with  small  quantity  of  hidden  nodes,  get  their  weights  quickly, 
and  with  the  network,  each  input  pattern  in  the  training  set  can  be  mapped  to  the  desired 
output  almost  exactly,  the  error  within  any  required  precision.  The  architecture  is  also  good 
for  implementation,  because  only  one  type  of  neuron  is  used  in  the  architecture. 


2.  Algorithms  Description: 

With  this  learning  algorithm,  the  desired  outputs  are  traced  back  through  the  nonlinear 
part  with  the  inverse  of  the  activation  function. 


a 


-1 


X 


1  4“  X 
1  -  X 


and  then  we  try  to  solve  the  problem  linearly. 

Denote  the  input  patterns  to  be  i4;v,n>  IV  is  the  number  of  patterns,  n  is  the  number  of 
input  dimensions.  And  denote  the  output  patterns  to  be  T/v.m)  is  the  number  of  output 
dimensions.  We  apply  to  each  element  of  T  and  get  J9jv,m-  The  range  of  (t(x)  is  (-1,1), 
to  apply  <T~^,  we  should  chop  those  output  values  of  1,  -1  a  little  (within  error  tolerance). 

We  hope  to  map  A  to  B  with  a  weight  matrix  W,i.e.  find  W  such  that  A-W  =  B.  To  have 
exact  solution  of  the  linear  equations,  we  require  that  each  column  of  ,j  =  l..m  should 
be  linear  combination  of  columns  of  A:a\i  =  l..n  with  scalers  Wij.  That  is,  we  require  A 
should  have  the  same  rank  with  [A|B]. 

For  any  A,B  with  N  patterns,  we  have  rank{A)  <—  ranfe([A|B])  <=  N.  When 
ronfc(A)  =  ranfe([A|B]),  we  can  find  exact  solution  W  for  A  •  W  =  B.  If  ranfc(A)  < 
ranfc([A|B]),  we  would  first  add  a  constant  bias  node  ”1”,  append  it  to  each  input  pat¬ 
tern.  This  would  most  probably  increase  the  rank  of  updated  input  patterns  by  one.  If  still 
rank{A)  <  ranfc([A|B]),  then  we  can  expand  rank{A)  by  appending  Linear-Independent 
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columns  to  A.  These  columns  must  be  derived  from  input  patterns  in  A.  We  would  add 
hidden  nodes  and  obtain  these  columns  as  their  output  vectors.  To  achieve  least  nodes,  we 
add  them  one  by  one,  each  one  choosing  its  weights  such  that  its  output  vector  would  be 
linear  independent  to  the  columns  of  A,  and  the  space  spanned  by  this  column  and  columns 
of  A  will  cover  the  space  spanned  by  columns  of  B  most.  To  get  weights  for  such  a  hidden 
neuron,  an  approach  is  to  apply  the  method  used  by  Fahlman  and  Lebiere  [3],  attempting 
to  maximize  the  the  covariance  between  the  new  unit’s  output  and  the  residual  error  we  are 
trying  to  eliminate. 

In  this  first  algorithm,  gradient  descent  method  is  applied  to  approach  maximum  co- 
variance,  it  still  consumes  some  time.  For  those  applications  where  learning  speed  is  more 
emphasized,  a  simplified  method  can  be  applied.  Instead  of  training  the  hidden  neuron’s 
weights,  just  have  the  weights  randomly.  When  rank{A)  <  ronfc([A|B])  <=  iV,  the  space 
spanned  by  columns  ofA  is  only  a  very  small  part  of  the  N  —  dimension  space.  The  weights 
are  random,  and  the  nonlinear  activation  function  distorts  the  sum  of  the  weighted  inputs 
,  so  we’re  almost  certain  to  have  each  new  column  of  the  hidden  neuron’s  outputs  to  be 
Linear-Independent  with  the  columns  of  A. 

From  the  constraint  of  rank{A)  <=  ranA:([>l|fi])  <=  N ,  we  can  see  at  most  N  —  r  linear- 
independent  vectors  are  needed  for  exact  learning  of  the  training  set.  It  is  the  upper  bound 
of  the  number  of  hidden  nodes  needed. 

In  either  way  above,  the  bias  nodes  are  hidden  nodes  with  fixed  weights,  and  they  are 
arranged  in  the  same  hidden  layer.  The  output  vector  of  each  hidden  node  is  attached  to  the 
corresponding  input  patterns,  thus  to  increase  the  rank  of  updated  inputs.  Pseudoinverse 
is  then  applied  to  train  the  weights  for  the  output  layer.  Pseudoinverse  is  a  direct  method 
that  achieves  least  square  error  solution  in  a  few  matrix  operations,  it  is  much  faster  than 
gradient  descent  method  which  tries  step  by  step  to  optimize  the  weights  for  least  suqare 
error  solution.  This  is  the  major  reason  that  we  speed  up  training  rapidly. 

3.  Algorithms: 

Algorithm  1.  (Bias  Nodes  with  Maximum  Covariance) 

1.  Arrange  data  of  the  input  patterns  in  matrix  form:  A.  Arrange  data  of  the  output 
patterns  in  matix  form:  T.  Each  pattern  in  a  row. 

2.  Add  a  constant  bias  node  ”1”.  Update  A  by  appending  a  column  of  all  “1”  as  its  last 
column. 
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3.  Decide  precision  of  leaming:e.  Chop  those  elements  ofT  to  be  within  [-l  +  e,  1  -e];  if 
any  element  is  larger  than  I  —  e,  let  it  be  I  —  e;  if  any  element  less  than  e  —  1,  let  it  be  e  —  1. 

4-  Set  Wi,ias  to  be  empty. 

5.  B  =  (t-HT); 

6.  A+  =  (>l^- 

7.  W  =  A+-  B; 

8.  E  =  \B-  A-W\2l{N  *m); 

9.  If  (Ei=e)  stop.  Else  go  to  step  10. 

10.  Add  a  hidden  node  with  random  weights  connecting  every  input  dimension  and  the 
constant  bias  node. 

11.  Update  the  weights  of  the  hidden  node  with  gradient  descent  method  to  approach 
maximum  covariance  (Refer  to  [3]). 

12.  Append  the  weights  of  this  hidden  node  to  Wbias  as  its  last  column. 

13.  Form  a  vector  of  outputs  of  the  hidden  node  corresponding  to  each  input  pattern. 
Update  A  by  appending  this  vector  as  its  last  column. 

14.  Goto  step  6. 

Algorithm  2;  (Bias  Nodes  with  Random  Weights) 

Void  step  11  in  algorithm  1. 

4.  Related  works: 

The  idea  of  this  Bias  Architecture  and  Rank- Expanding  Algorithm  was  inspired  after 
studying  the  work  of  Friedrich  Biegler-Konig  and  Frank  Barmann  [2],  and  the  work  of  Scott 
E.  Fahlman  and  Christian  Lebiere  [3]. 

A  Bias  Architecture  is  a  Cascade- Architecture  in  the  sense  that  the  weights  for  the  bias 
nodes  are  fixed  and  then  we  combine  the  input  patterns  with  bias  outputs  to  train  the  ouput 
layer. 

But  the  Cascade-Correlation  Algorithm  adds  hidden  nodes  one  by  one,  each  in  a  different 
layer, i.e.  each  new  node  is  a  deeper  layer,  while  we  put  bias  nodes  in  the  same  hidden  layer, 
thus  to  reduce  the  responding  time.  And  we  apply  Pseudoinverse  to  train  the  output  layer 
directly,  instead  of  applying  gradient  descent  as  carried  out  in  [3],  therefore  the  training  can 
be  much  more  rapidly. 

In  the  second  method  of  adding  bias  nodes  with  random  weights,  another  difference  is 
that  the  Cascade-Correlation  Algorithm  adds  hidden  nodes  by  maximizing  the  covariance. 


m-745 


here  we  don’t  train  the  weights  for  the  hidden  nodes,  just  have  the  weights  randomly,  only  to 
expand  the  rank  of  the  input  and  to  have  more  weights  involved  for  output,  so  as  to  achieve 
an  exact  solution  in  the  linear  part.  This  helps  to  reduce  training  time  while  still  mapping 
the  input-output  exactly,  but  it  might  require  some  more  bias  nodes. 

The  method  of  separating  linear  and  nonlinear  parts  on  each  layer  of  nodes  in  the  net¬ 
work  was  presented  in  [2].  But  they  didn’t  discuss  the  network  architecture,  they  adopted 
conventional  multi-layer  network.  When  the  nonlinear  part  on  one  layer  is  separated,  there 
are  still  nonlinear  parts  on  the  other  layers  involved.  In  their  Least  Squares  Backpropaga- 
tion  Algorithm,  linear  least  squares  computation  (Pseudoinverse)  is  applied  back  and  forth 
through  the  multilayers  repeatedly,  trying  to  diminish  the  error.  But  usually,  after  one  or 
two  iterations  the  error  will  remain  and  not  go  on  decreasing,  and  they  suggested  to  use 
Backprop  or  other  learning  algorithm  to  further  reduce  the  error,  which  again  would  be  time 
consuming. 

The  authors  claimed  a  special  case  under  which  they  could  achieve  exact  solution  using 
LSB:  Especially  if  the  network  includes  a  hidden  layer  with  more  nodes  than  the  number  of 
exmples  to  be  learned  and  if  the  number  of  nodes  in  succeeding  layers  decreases  monotonically, 
the  presented  algorithm  in  general  finds  an  exact  solution.  But  they  didn’t  explain  the  reason. 
In  fact,  when  adopting  conventional  multilayered  neural  network,  if  the  last  hidden  layer  has 
no  less  hidden  nodes  than  N,  the  number  of  examples  to  be  learned,  then  we  can  achieve  exact 
solution  in  general.  Similarly,  the  weights  for  the  hidden  layers  were  randomly  given,  so  the 
output  columns  of  the  nodes  (more  than  N)  in  the  last  hidden  layer  are  evenly  distributed 
in  the  N  dimension  space,  and  the  rank  of  these  columns  together  would  most  probably  be 
N ,  thus  we  can  achieve  exact  solution  in  the  output  layer. 

5.  Conclusions: 

With  this  Bias- Architecture,  we  separate  the  hidden  layer  and  the  output  layer  in  training, 
so  we  can  traverse  the  nonlinear  part  of  the  output  layer  with  the  inverse  of  activation 
function,  then  no  other  nonlinear  problem  remains,  and  we  transfer  the  problem  of  neural 
networks  that  coT’’'ines  both  linear  and  nonlinear  parts  to  the  problem  of  solving  linear 
equations,  which  a  more  familiar  topic  to  us,  and  has  a  good  method  of  Pseudoinverse  for 
us  to  apply. 

The  first  algorithm  of  training  the  hidden  neurons  to  have  maximum  covariance  is  an 
approach  to  obtain  an  architecture  with  minimum  hidden  nodes.  The  second  algorithm  aims 
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to  further  speed  up  training  by  adopting  hidden  nodes  with  random  weights,  which  can  also 
serve  to  expand  the  rank  of  combined  inputs.  But  it  might  require  some  more  nodes  than 
the  first  algorithm  do. 

We  still  expect  more  numerical  analysis.  There  may  be  some  faster,  better  ways  to  find 
the  best  weights  for  the  bias  nodes  to  achieve  least  number  of  nodes,  each  time  appending 
vectors  that  helps  best  to  expand  the  space  to  include  the  output  vectors. 
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ABSTRACT 

One  of  the  widely  used  neural  n^ork  models  is  the  back  propagation  (BP)  artificial  neural  system  (ANS). 
It  is  a  multilayer,  heterogenous,  supervised,  feed-forward  ANS  paradigm.  Slow  convergence  of  the  BP 
learning  algorithm  hampers  its  use  for  problems  with  a  conq)iex  and/or  large  feature  space.  We  have  devel¬ 
oped  a  simple  and  scalable  acceleration  technique  which  preserves  the  convergence  characteristics  of  the 
BP  ANS  paradigm.  The  convergence  or  divergence  of  die  system  is  detected  by  the  dominant  eigenvalue 
for  each  layer.  We  have  discovered  a  relationship  b^ween  the  speed  of  convergence  and  die  dominant 
eigenvalues.  As  the  eigenvalue  deviates  from  1,  the  temperature  of  the  network  is  adjusted  to  over  come 
the  local  minima.  Numerical  experiments  indicate  a  reduction  in  the  learning  time  for  large  complex  prob¬ 
lems. 


Introduction 

The  back  propagation  (BP)  artificial  neural  network  (ANN)  algorithm  is  widely  used  because  of  its  sim¬ 
plicity  and  :q^licability  to  various  problems.  Pattern  Recognition,  combinatorics  and  controls  are  some  of 
the  major  areas  of  its  application.  However,  slow  off-line  learning  hinders  its  rqiplication  to  many  problems 
with  a  large  and  complex  feature  ^ce.  The  training  of  tlK  BP  paradigm  involves  a  fixed  learning  sched¬ 
ule.  During  training  the  system  searches  for  a  minimum  oior  si^ace  in  the  weight  space.  The  error  surface 
is  usually  degenerate  with  numerous  flat  spots,  valley  and  other  unevenness.  To  aid  the  convergeiKe  of  the 
system  various  parameters  of  the  learning  schedule  should  be  varied  in  a  controlled  fashion  for  the  faster 
convergence.  The  purpose  of  this  research  is  (o  develop  an  acceleration  technique  for  the  BP  paradigm. 
One  of  our  main  goals  was  to  achieve  acceleration  without  alt^ng  the  BP  algorithm. 

Theory  of  Back  Propagation  Paradigm 

Ihe  BP  ANN  algorithm  traces  back  to  1956,  when  Rosenblatt(1962)  introduced  the  first  connection  model 
called  the  perception,  which  uses  the  delta  learning  rule.  The  delta  learning  rule  calculates  error  at  the  out¬ 
put  processing  units  using  simple  Euclidean  distance  in  1-dimensional  space.  The  actual  output  is  calcu¬ 
lated  using  the  forward  propagation  rule  given  by; 


O  .  =  f 

PJ  ■' 


L.  I 


where. 


in-748 


f(x)  =  j  ' 

-1  othewise 


The  delta  nde  is  not  restricted  to  binary  values.  The  BP  paradigm  can  also  be  extended  for  continuous  sig¬ 
nals  in  disaete  time. 


Generalized  Delta  Rule 

The  generalized  delta  rule  was  proposed  to  solve  the  problems  of  learning  in  a  feed-forward  n^work  with 
a  nonlinear  activation  function.  It  is  a  powerful  learning  algorithm.  It  carries  out  an  apfx^oximation  of  a 

bounded  mailing /;(ac  a” using  the  training  pairs  (Xj.yj),  (Jt2,y2)»  •••»  wiihthe 

mailing  of  =  /  (x^) ,  v^re.  f  is  an  unknown  inq^cit  function  which  the  system  evolves  through  the 

adaptation  of  its  internal  representation.  The  error  is  iteratively  propagated  back  dirough  the  hidden  net¬ 
work  iayers  towards  the  in^t  layer.  The  weights  are  adjusted  using 


where. 


for  hidden  laye  •'ode, 

k 
» 

(tpj  -  Opj)  f  j  inetpj)  for  output  units. 

The  delta  rule  is  t^lied  iteratively  until  the  n^ork  converges. 

Minsky  and  Ps^)ert  (1969)  have  pointed  out  that,  for  any  recurrent  architecture,  an  equivalent  feed-forward 
ANS  exists.  The  generalized  delta  rule  is  therefore,  applicable  to  feed-forward  as  well  as  equivalent  recur¬ 
rent  network  systems. 

Acceleration  technique  using  the  Dominant  Eigenvalue  of  the  Iteration  Matrix 

Back  Propagation  [Rumelhart(1986)]  is  a  powerful  supervised  learning  algorithm  for  multilayer  feed-for¬ 
ward  ANNs.  It  is  an  estimating  system  that  stores  generalized  solution  of  arbitrary  pattern  pairs,  using  the 
gradient  descent  error  correction  procedure.  It  is  extremely  popular  and  has  been  used  for  a  variety  of 
ii^Hit/ouqwt  nuqiiring  tasks  in  die  pattern  recognition  and  classification  problems.  The  BP  learning  [voce- 
dure  is  off-line.  It  has  a  very  slow  convergence  characteristic.  The  expanding  [xoblem  complexity  have 
forced  researdiers  to  discover  new  algorithms  to  accelerate  convergence. 

The  foeory  of  linear  iterative  methods  provide  a  treasure  of  acceleration  algorithms.  However,  the  artificial 
neural  systems  are  inherently  nonlinear  in  operation.  Therefore,  any  direct  use  of  linear  system  accdera- 
tion  tedmiques,  to  accelerate  the  ANS,  is  neither  feasible  nor  practical.  Here,  we  have  developed  a  new 
technique  to  noonitor  and  predict  the  evolution  of  the  connection  strength.  We  adjust  the  temperature  of  the 
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activation  function  of  the  processing  demaits  for  faster  convergence. 

Ctnsider  a  feed-forward  multilayer  ANS  architecture.  The  mapping,  done  by  the  network,  up  to  the  first 
hidden  layer  is  linear.  For  a  piecewise  linear  model,  the  mapping  from  the  input  layer  to  the  output  layer, 
can  be  assumed  linear.  Therefwe  the  set  of  weights,  in  the  first  hidden  laver.  evolves  linearly.  Any  non-lin¬ 
earity  or  abrupt  change  in  the  error  surface  is  detected  by  raj^d  variation  of  the  dominant  eigenvalue  calcu¬ 
lated  frv  the  Iteration  matrix  of  the  first  hidden  layer.  Hie  iteration  matrix  is  calculated  as  follows: 

Hie  weight  update  is  given  by; 


Without  loss  of  generality,  we  can  assume  «  i .  Hie  iterative  learning  rule  can  be  written  as 

For  tile  first  hidden  layer,  the  iteration  matrix  G  is  Ip  tim^  an  identity  matrix  for  the  first  hidden  layo^.  The 
iteration  matrix  G  is  a  square  and  a  symmetric  and  positive  definite  matrix.  For  the  networic  to  converge, 
all  eigenvalues  (X)  of  the  iteration  matrix  must  be  less  then  1. 

We  may  use  the  residuals  (change  in  weights)  to  estimate  the  dominant  eigenvalue  of  the  iteration  matrix, 
obtained  by  successive  rqiproximations.  is  given  by  the  following  equation 


X,  = 


IW"*”] 

IW"’] 


The  above  Equation  is  ^milar  to  Rayleigh's  quotient  formula. 

Consider  the  feed-forward  ANN  architecture,  Qie  weights  between  tiie  first  bidden  layer  units  and  the  input 
buffer  do  not  contain  any  nonlinearity.  Therefore,  they  map  the  irqiut  patterns  presented  at  the  input  buffer 
linearly  to  the  first  hidden  layer  during  the  forward  pass.  Hence  we  may  vpply  the  tiieory,  discus^  earlier 
to  accelerate  the  convergence  of  the  BP  algorithm.  Hie  piece-wise  linear  assunqition  rules  out  ttie  use  of 
any  linear  system  for  accderation  of  the  linear  systmns.  However,  the  dominant  eigenvalue  of  the  iteration 
matrix  can  be  used  as  a  parameter  to  monitor  the  evolution  of  the  network.  Hie  contention  is  also  sup¬ 
ported  by  the  fact  that,  any  generate  mapping  is  fairly  continues  and  therefore  linear  within  small  intervals. 

We  have  also  observed  that  the  dominant  eigenvalue  of  the  other  layers  follow  the  dominant  eigenvalue  of 
the  first  hidden  layer.  Hius,  the  sdected  eigenvalue  reflects  the  convergence  state  of  the  overall  n^ork. 

The  dominant  eigenvalue  of  the  first  hidden  layer  may  be  used  to  monitor  the  evolution  of  the  n^ork. 
The  cooling  of  the  system  is  prqxirtional  to  the  dominant  dgenvalue  of  tiie  iteration  matrix.  Whenever  the 
system  dgenvalues  move  away  from  tiie  desired  value,  the  deviation  is  used  to  raise  the  energy  of  the  sys- 
tmn.  Hie  dominant  eigenvalue  greater  than  one  indicates  that  the  system  is  ascending  in  the  weight  space. 
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At  ttils  Juncture,  the  energy  of  the  system  is  raised,  in  a  stq),  to  some  higher  level.  The  system  starts 
descending  in  the  weight  q>ace  again  using  the  gradient  descent  algorithm. 

Hie  acceleration  Algorithm 

The  qpedhc  steps  of  the  new  algorithm  are  as  follows 

1.  IhebackpropagatlonnetworkU^logy  and  the  learning  parameter  values  are  selected. 

2.  Random  values  in  ttie  range  of  [-0. 5, 0.  S]  are  assigned  to  the  connected  weights. 

3  Fdr  eadi  input  pattern  new  activation  value  fix  die  each  processing  unit  in  the  forward  pass  is  calcu¬ 
lated. 

4  The  output  error,  using  the  least  mean  squared  error  criterion  for  the  output  buffer  processing  units,  is 
calculated  and  the  error  is  propagated  backwarcte. 

5.  The  connection  strength  is  adjusted  using  die  correlation  technique. 

6.  After  one  com{dete  presentatirm  of  the  iiqait  set  (an  epoch),  the  eigenvalue  of  the  first  hiddra  layer 
matrix  is  calculated. 

7.  The  temperature  of  the  network  is  adjusted  as  follows: 

a.  If  the  eigenvalue  of  iteration  matrix  is  greatm*  than  one  then  the  temperature  of  the  network  is 
increased  inversdy  prc^rtional  to  the  eigenvalue. 

b.  If  the  eigenvalue  continues  to  remain  below  one  then  the  temperature  is  reduced  gradually. 

9.  Steps  3  through  7  are  iterated  until  convergence  criterion  is  satisfied. 

Results  and  Discussion 

The  eigenvalue  technique  is  evaluated  for  the  problems  widely  used  for  benchmarking.  (Currently  there  are 
no  accepted  standards  fix  bendimarldng  neural  networks.  However,  the  XOR  problem  is  mosUy  used 
because  of  its  historical  imprxtance.  Parity  3  through  parity  5  are  good  cases  fix  checking  scalability  and 
response  of  any  acceleration  technique  to  increasingly  complex  problems.  The  results  obtained  during  the 
test  runs  and  the  analysis  of  the  results  with  the  description  of  the  problem  is  presented  below.  F(x  all  the 
tests  we  took  the  average  of  the  eigenvalues  lor  different  intervals  in  order  to  reduce  the  effect  of  noise. 

XOR  Problem 

Ihe  Exclusive  problon  is  a  iineariy  insqrarable  problem.  Hie  two  input,  one  ouqxit  training  pairs  are 
shown  in  Ikble  1. 


Table  1:  XOR  problem  input/output  groups 


Input  1 

Input  2 

Output 

0 

0 

0 

0 

1 

1 

1 

0 

1 

1 

1 

0 

Parity  n  problems  are  the  extension  of  ttie  parity  2  problem  with  increasing  number  of  the  iiqxit  attributes. 
Hie  feature  ^>ace  grows  comfdex  with  the  number  of  inputs.  The  number  of  patterns  grows  exponentially 
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with  ttie  number  of  input  attributes.  Ibble  2  shows  the  input/output  patterns  pairs  for  the  parity  3  problem. 

Ihblcl: 


Problem 

BP 

EVAT 

XOR 

459 

159 

Parity  3 

2048 

269 

Parity  4 

4056 

1565 

From  the  table  we  see  that  there  Is  a  signiticant  reduction  in  the  convergence  rate  by  the  new  technique  that 
we  have  developed 

Conclusions 

A  new  technique  for  the  acceleration  of  the  back  propagation  paradigm  was  developed  and  evaluated.  The 
motivation  of  die  wMk  was  to  develqp  an  acceleration  tedinique  without  altering  the  convergence  proper¬ 
ties  of  die  standard  BP  paradigm.  Another  goal  was  to  sinqillfy  the  training  procedure  by  reducing  the 
degrees  of  freedom.  Generally,  the  choice  of  learning  parameters  is  critical  to  convergence  time.  The  com¬ 
plex  interactions  of  the  parameters  necessitates  thorough  knowledge  of  the  problem  features.  Thus  eigen¬ 
value  acceleration  technique  reduces  the  complexity  of  training  the  BPparadigm.  The  goal  was  to  devise  a 
general  technique  to  monitor  die  energy  state  of  the  neural  system.  It  can  help  in  deciding  the  optimal 
energy  state  for  die  system.  We  have  successfully  shown  that  using  the  dominant  eigenvalue  of  the  itera¬ 
tion  matrix  of  the  first  hidden  layer,  qidmal  energy  of  the  system  can  be  decided.  This  information  is  used 
to  accelnate  the  convergence  of  the  system  towards  lower  energy  level. 

The  technique  developed  here  for  acceleration  of  the  BPparadigm  does  not  alter  the  algoridun.  Also,  it 
does  not  use  any  special  feature  of  the  algorithm.  Thus,  the  method  may  be  ^licable  to  any  feed-forward 
ANS  architecture.  Further  research  is  being  done  in  this  area  for  inaeasing  the  acceleradon  of  the  BP  par¬ 
adigm.  Also  we  have  observed  a  relation  between  the  speed  of  convergence  of  the  network  and  eigenval¬ 
ues.  This  relation  has  shown  that  diere  is  a  reduction  in  the  time  of  convergence  rate  of  the  network.  Based 
on  mimerical  experiments  our  future  work  will  focus  on  a  more  rigorous  derivation  of  the  mathematical 
relationships  and  numerical  experimentation  with  more  complex  problems. 
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Forward  Propagation  (FP  —  1)  Algorithm  in 
Multilayer  Feedforward  Network 
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Abstract 

This  paper  proposes  a  new  forward  propagation  algorithm  (FP  —  1)  in  multilayer  feedforward 
network.  The  new  approach  is  a  constructive  algorithm  transferring  errors  forward  while  not  chang¬ 
ing  the  established  structure,  neurons,  and  the  trained  weights  in  the  network.  The  concept  of 
FP  —  I  mapping  space  is  defined  in  which  the  program  of  approximation  is  just  to  get  inversion  of 
vectors  and  determine  the  FP  —  1  areas.  Several  new  definitions  are  introduced,  such  as  FP  —  1 
areas,  nonlinear  distribution  chart,  global  and  local  approximation,  etc.  Using  FP  —  1  algorithm, 
for  the  arbitrary  mapping  ,  we  can  know  accuracy  of  every  neuron  in  network  is  able 

to  arrive,  and  understand  every  step  of  approximation.  FP  —  1  algorithm  includes  an  important 
principle  how  to  decompose  the  given  mapping  into  global  and  local  components  and  then  to  solve 
the  problem  by  using  global  and  local  approximation.  The  purpose  of  this  paper  is  to  provide  basic 
idea  of  Forward  Propagation,  specifically  FP  —  1  mapping  space  is  emphasized. 

1  Introduction 

Although  BP  algorithm  is  used  broadly  in  many  applications,  intrinsic  mechanism  of  neurons  in  the  network, 
training  methods  and  convergence  behavior  are  imperfectly  understood,  and  huge  amount  of  computer  time 
are  consumed,  especially  during  training.  With  impressive  successes  across  a  wide  variety  of  applications,  this 
approach  prompts  many  questions  that  have  to  be  answered. 

1.  How  many  hidden  units  should  be  used  for  the  given  mapping.  In  this  sense,  is  that  possible  for  us  to  define 
or  classify  the  given  mappings  to  different  mapping  spaces  which  can  be  established  in  neural  network  with 
different  complexity  or  difficulty? 

2.  Why  can  the  nonlinear  neurons  used  in  network  approximate  the  given  mapping  with  arbitrary  precision? 
What  is  real  behavior  or  important  but  still  unknown  principles  in  the  network? 

3.  What  concrete  types  of  nonlinear  neurons  are  best  to  be  used  in  the  network  according  actual  applications 
or  data  set,  and  how  to  choose  them? 

4.  Can  we  know  the  errors  of  every  step  of  approximation  or  construct  the  hidden  layer  neurons  definitely  to 
eliminate  the  errors  to  arbitrary  desired  accuracy? 

5.  Is  the  BP  algorithm  the  only  approach  to  implement  feedforward  multilayer  network?  whether  or  not 
there  are  other  algorithms  to  implement  the  mapping  more  efficiently  and  easily? 

This  paper  will  address  these  issues,  and  more  importantly,  it  will  explore  and  present  a  new  ideas  of 
ForwardPropagation  (FP)  approach,  in  which  the  errors  are  propagated  forward  and  then  we  can  know  where 
the  “bottleneck”  of  approximation  happens.  The  algorithm  proposed  in  this  paper  is  the  first  algorithm  in 
series  of  FP  approach.  For  this  reason,  it  is  denoted  as  FP  —  1  algorithm.  In  fact,  the  FP  approach  provides 
detail  information  at  every  step  of  forward  approximation  and  unveils  many  interesting  features  in  mapping  /: 
72J*  — ►  Rip.  Another  important  theory  proposed  in  this  paper  is  to  decompose  global  problems  into  local  ones, 
and  then  to  fit  together  local  components  solved  to  satisfy  the  global  properties  needed. 
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2  Algorithm  of  Feedforward  Propagation 

The  idea  of  FP  —  1  algorithm  is  based  on: 

1.  First  to  construct  two  layer  neural  net.  The  linear  and  nonlinear  optimal  algorithms  are  used  to  approxi¬ 
mate  the  given  mapping  input  and  target  data  set. 

2.  If  two  layer  neural  net  can  not  reach  the  given  precision,  the  third  layer  is  established  and  the  remained 
errors  are  erased  in  neurons  in  third  layer. 

3.  If  the  desired  accuracy  is  not  satisfied  in  third  layer,  the  hidden  neurons  are  needed  to  continue  to  reduce 
the  errors  until  the  given  precision  arrives. 

An  important  feature  in  FP  is  that  the  previously  trained  weights  and  nonlinear  function  gj  are  not  changed 
while  the  third  or  hidden  neurons  are  added,  just  the  coefficients  of  the  new  neurons  are  implemented. 

2.1  First  two  layer  linear  approximation 

As  mentioned  above,  for  a  given  mapping  12^  — »  fZf*  i  =  1,  ...,r,  n  <  r,  the  task  is  decomposed  as  — ►  72, 

problem  at  first.  The  two  layer  network  is  constructed  shown  in  Fig.l. 

In  order  to  induce  the  idea  of  FP  —  1,  the  linear  optimal  approximation  will  be  used,  and  the  algorithm  of 
least  mean  square  (LMS)  developed  by  Widrow  and  Hoff  is  briefly  described  here. 

Assuming  there  are  r  input  vectors  Xi  with  n  dimensions  corresponding  to  r  output  Yi  t  =  1, ...,  r,  n  <  r 
with  m  dimensions,  there  is  linear  estimation. 


Ifj'i  =  X^***"'i*  «=l,...,r  j=l,...,m  (1) 

fc=i 

and  desired  output  is  yj,-,  linear  output  in  second  layer  is  the  mean  squared  error  function  is 

E{Wi)  =  (l/2)^(y,i  =  (l/2)f^(y,<  -  j  =  l,...,m  (2) 

•=1  isl  tsl 

The  gradient  at  any  point  on  the  surface  is  obtained  by  differentiating  dE{Wj)/dwjk  with  respect  to  the 
parameter  vector  Wj,  where  Wj  is  the  weight  vector  from  input  to  jth  neuron  in  second  layer; 


=  i  =  fc  =  i . n 


(3) 


«=l 


The  LMS  can,  therefore,  be  written  as 


•  si 


Eq.(4)  determines  Wjt  under  criterion  of  LMS.  From  this  point  of  view,  the  weights  of  two  layers  are  decided 
in  linear  optimal  approximation.  Let  us  take  mapping  1  shown  below  as  an  example. 

Mappingl 


■  -.55 

-.7 

-.95 

-.78 

-.65  ■ 

■  -.15  ■ 

-.4 

-.44 

-.38 

-.2 

-.51 

-.38 

-.32 

-.4 

-.38 

-.43 

-.19 

-.75 

-.21 

-.03 

.05 

-.18 

.03 

-.21 

.38 

.18 

.21 

-.12 

-.13 

.09 

.31 

.19 

.21 

-.07 

.15 

.18 

.39 

.23 

.31 

.41 

.25 

.45 

.5 

.68 

.13 

.5 

.41 

.53 

.71 

.44 

.68 

.6 

.78 

.8 

.48 

.5 

.82 

.73 

.19 

-.52 

After  iterative  calculation,  the  mean  square  error  is  0.6645.  The  target,  actual  output  and  error  values  are 


target 

-.15 

-.38 

-.75 

-.21 

.09 

.18 

.45 

.53 

.8 

-.52 

app.  result 

-.238 

-.419 

-.095 

-.18 

.286 

.087 

.28 

.674 

.708 

-.173 

error 

I'l  . — 

.088 
I'T  11-^ 

.039 

-.655 

-.029 

'■  -  I 

-.196 

.094 

.17 

-.144 

.091 

-.347 

Tab.l 


After  the  linear  approximation  in  first  step,  we  come  to  the  important  issue,  how  to  choose  nonlinear  function 


aj- 
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Definition  2.1  Let  M  -*  N(Xi  £  M,yi  £  N)  be  arbitrary  given  mapping  <fi  :  R"  -*  Hi  1=1, r,  n  <  r,  after 
linear  optimal  estimation  (LMS),  there  are  actual  linear  output  y''  and  set  V,  yf'  €  V.  there  still  exists  mapping 
Qj  :V—*N,  then  gj  is  called  nonlinear  function  for  the  given  mapping  <p,  and  the  curve  of  relationship  between 
y''  and  yt  is  denot^  nonlinear  distributive  chart  (NDC). 

This  definition  describes  that  rui  arbitrary  mapping,  :  Ri  — »  fZt,  in  jth  neuron  of  second  layer,  which  can  be 
decomposed  into  two  steps:  linear  mapping  and  nonlinear  mapping.  The  relationship  between  yf‘  and  y,  unveils 
information  of  what  kind  nonlinear  function  should  be  used  as  shown  in  Pig.2.  Although  NDC  varys  in  different 
domains,  we  can  concentrate  on  a  class  of  sigmoid  functions  in  FP  —  1  by  choosing  coefficients  aj ,  bj ,  cj ,  dj  in 
formula  (6)  in  jth  neuron.  Since  the  domain  discussed  in  FP  —  1  is  x,-  and  y,  E  [—  1 , 1) ,  we  define  sigmoid  function 
in 


9j  —  Vji  — 


1  +  exp(-(id|  +  Oj)/6j) 


i=l,...,r  i=l,...,m. 


Where  the  ■  is  jth  neuron  output  of  linear  approximation  in  second  layer,  and  is  jth  output  of  nonlinear 
function  gj  in  second  layer,  the  sum  of  squared  errors  are 

£  =  (1/2)  i=l . m  (7) 

•=i 

The  partial  derivatives  of  E  with  respect  to  variables  aj ,  bj ,  cj ,  dj  are 

dE  ^  cj  exp(-(y*i‘i  +  aj)/bj)  .  . 


daj  ~  bj  (I  +  ezp(-(y^l  +  aj)/bj))^ 


dE  ^ 

“/"S'" 


Cjexp(-(!/^i  +  «j)/h) 
b](l+  exp(-(!/jj  +  (tj)/bj)y 


_ 1 _ 

dcj  ~  ^  1  +  exp(-(!/^l  +  oj )/hj ) 


j  =  1 . ”»• 


j  =  l,...,m. 


j  =  l,...,m. 


r  =  i=i . 


where  tjj  =  y;i— yf"*,  y>i  is  desired  output  in  jth  node,  then  the  algorithm  of  adapting  aj{t),  bj{t),  Cj{t),dj{t) 
is  given  by 

a.  (f )  =  o, (<  -  1 )  -f-  /«(-  ,0  (12) 


doj  (t  — 

T) 

dE 

dbj(t- 

ly 

dE 

dCj(t- 

Ty 

dE 

ddj(t- 

1) 

Fig.3  shows  nonlinear  distributive  chart(NDC)  and  Tab.2  illustrates  the  values  after  nonlinear  approximation 
in  mapping  1.  The  squared  error  is  reduced  from  0.6645  of  linear  mapping  to  0.5505  of  nonlinear  mapping,  with 


target 

-.15 

-.38 

-.75 

-.21 

.09 

.18 

.45 

.53 

.8 

-.52 

app.  result 

-.383 

-.557 

-.223 

-.321 

.244 

-.001 

.237 

.642 

.669 

-.313 

I  error  .233  |  .177  |  -.527  |  .111  |  -Ahi  ]  .181  ]  .213  |  -.112  |  .131  |  -.207  | 

2.3  Construction  of  the  third  layer  and  detenuinatiou  threshold  nonlinear  function 
If  two  layer  net  can  not  reach  the  given  precision,  the  third  layer  should  be  established. 

Definition  2.2:  Let  be  the  weights  between  second  and  third  layer  ,  =  1  if  t  =  j,  the  connections 

with  tuff*  =  1  from  second  to  third  layer  are  denoted  main  information  channels. 
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One  of  important  features  in  FP  is  the  concept  of  the  main  information  channels  which  is  much  different 
from  general  network.  The  main  information  channels  keep  the  previous  achievements  transferred.  The  remained 
errors  are  considered  as  the  target  set  approximated  in  the  third  layer.  Since  the  error  set  should  be  divided 
several  subsets  which  are  easily  eliminated,  based  on  the  idea  of  FP,  the  subthird  layer  notion  is  introduced 
here, 

Definition  2.3;  Assuming  there  are  m  output  nodes  in  third  layer  for  the  given  mapping  R^'-  If  there 

are  neurons  between  second  and  third  layer,  each  of  them  has  m  —  1  weights  (j  =  I,  ■■■,  m  i  =  I,  1) 

and  nonlinear  threshold  function  gj ,  these  neurons  are  denoted  subthird  neurons. 

The  purpose  of  subthird  layer  is  to  focus  and  erase  the  specified  error  subset,  which  is  local  area  processing. 
The  subthird  layer  is  constructed  in  Fig.4,  the  dark  circles  represent  subthird  neuron.  The  approximated  results 
of  subthird  and  hidden  neurons  are  summarized  in  third  layer  to  reach  global  approximation,  each  of  them  focuses 
on  specified  error  subset.  The  baisic  idea  of  FP  is  to  separate  error  set  into  several  subsets,  and  according  to  the 
given  accuracy  of  approximation,  the  error  subsets  will  be  erased  one  by  one,  by  the  subthird  and  the  hidden 
neurons  until  the  desired  precision  is  gotten.  Assuming  there  are  m  output  neurons  and  r  mapping  pairs. 
Definition  2.4:  Let  (Tj  =  [cji,  fjj, ...,  Cjr]  be  error  set  in  jlh  neuron  of  second  layer  and  itj  is  transferred  to 
third  layer  through  main  information  channel,  if  there  exists  relationship  |eji|  >  |ej2l  >  >  kjrl,  the  set  aj  is 

called  absolute  partial  order  error  set. 

After  the  approximation  of  second  layer,  the  error  set  is  transferred  to  third  neurons  totally  and  rearranged 
to  absolute  partial  order  error  set.  Since  there  is  one  connection  being  main  information  channel,  in  which 
weight  equals  to  one,  there  are  m  —  1  weights  needed  to  be  determined  in  every  neuron  of  subthird  layer. 
The  error  subset  =  [eji,  ...,ej(m_i)]^  is  selected  to  be  reduced  first  in  jth  node  of  third  layer,  which  are 
maximum  absolute  values  in  error  set  tTj.  Meanwhile  there  must  be  a  subset  •••i  and 

TjV*‘  =  [%■"*.  (j  =  l,...,m  i  =  l,...,m-  1)  in  second  layer  corresponding  to  rrj"*  =  [cji,  ...,ej(„,_i)]^ 

in  subthird  layer,  and  then  we  can  get  Wj~‘  =  where  [Vj~‘  is  the  weight  vector  from  second 

layer  to  jih  neuron  of  subthird  layer,  so  that  there  are  only  m  —  1  variables.  On  the  other  hand,  the  threshold 
function  should  be  decided  by  subset  There  are  three  kinds  of  possible  distributions  in  subset  Let 

rti  be  negative  minimum,  6i  be  negative  maximum,  oj  be  positive  minimum,  62  be  positive  maximum  in  subset 
Fig.5-7  demonstrate  the  three  kinds  of  distributions. 

Definition  2.5;  Let  error  set  be  =  [cji,  ■■.,ejr\  in  jth  neuron  of  third  layer,  which  is  absolute  partial  order  set. 
Ohoosing  the  first  m  —  1  elements  to  form  subset  =  (cji,  there  must  exist  one  of  distributions 

shown  in  Figs.5-7,  the  dark  areas  in  these  figures  are  called  FP  —  1  areas,  and  is  denoted  A,  in  jth  neuron. 

The  FP  —  1  areas  Aj,  in  fact,  define  the  active  intervals  for  the  nonlinear  threshold  function.  This  definition 
actually  gives  the  way  of  how  to  decide  the  threshold  function. 


For  example,  the  mapping  1  absolute  partial  order  error  set  is  shown  in  Tab.3. 


^11 

^12 

ei3 

ei4 

ei5 

ei6 

ei7 

CIS 

^19 

^1(10) 

-.5274 

.2328 

.213 

-.2074 

.1813 

.1767 

-.1542 

.1311 

-.1115 

.1108 

The  subset  rr*"*  is  [—.5274,  .2328,  .213,  —.2074],  which  is  case  1  shown  in  Fig. 5.  Now  there  exist  FP  —  1  areas 
ai  =  — .5274  — e,  6;  =  — .2074+e,  rt2  =  .213  — e,  62  =  .2328  +  c,  (0  <  e  «  1).  The  threshold  function  of  jth 
neuron  in  subthird  layer  is. 


m— 1 


yji 


1=1 
=  0 


i  =  l,...,r. 


f  «1  <  1/;T*  <  h 
I  «2  <  yjT*  <  h 

otherwise 


(16) 


where  3/^,“*  is  the  jth  neuron  output  in  subthird  layer  (j  =  1, ...,  m  i  =  1, ...,  r),  which  is  sifted  by  threshold 
function  gj .  It  is  obvious  that  this  nonlinear  threshold  function  is  defined  by  FP  —  1  areas,  which  are  decided 
by  For  the  convenience  of  discussion  we  use  Ay  to  represent  the  threshold  function  in  jth  neuron  of 

subthird  layer.  This  threshold  function  is  able  to  erase  error  subset  (Ty“*  under  condition  of  the  mapping  of 
other  elements  eyj  ^  Ay  not  falling  into  the  FP  —  1  areas,  because  Wy  is  derived  by  and  <Ty“*,  in  output 
set  Yj~*  =  [j/yf*!  •>yy7*]  fbere  must  be  m  —  1  elements  y*j"‘  (/  =  l,...,m—  1)  G  <Ty“*.  If  the  other  elements 
y‘^*  (k  =  l,...,r—  m  +  1)  ^  <Ty“*  do  not  get  into  Ay  decided  by  (Ty“*,  this  threshold  function  can  erase  subset 
<ry  otherwise,  is  appears  more  complicated  mapping  space,  which  will  be  discussed  in  FP  —  2. 
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Definition  2.6:  Let  mapping  -♦  fi,  (i  =  m  -  1  <  r)  be  ♦  ;  V}  =  [Y/P' ,  — *  a’"'  = 

[cji,  ...,ej(m_i),ejm.  ••■ifijrF  from  second  to  subthird  layer,  and  there  is  absolute  partial  order  error  set  aj  = 
(c^i,  ....e^r]  in  neuron  of  the  third  layer.  If  there  exist  subset  =  [e^j,  ...,e^(m_ij^  (m  -  1  <  r,  e^,  e  aj), 
which  is  the  first  m—  1  elements  of  <Tj,  and  corresponding  subset  is  V^*"*  =  [y/"*, (^>*1"*  ^  ^i)>  ^i**n 
there  must  exist  FP  —  1  areas  Aj  and  W^”*  =  (iVy'"*)  ^  0).  The  linear  mapping  (r^“‘  =  YjWj~* 

is  gotten,  there  is  relationship  also  there  is  a  subset  Oj  =  aj"*  —  <r^“*  =  [ijm,  ..  .eyr].  If  the  subset 

ffj  settles  outside  FP  —  1  areas  Aj,  the  linear  mapping  {^,Yj,Wj~*,Aj)  is  called  FP  —  1  local  mapping  space, 
and  is  denoted  4',  <Tj~*  is  called  the  checking  set,  ffj  is  called  the  generated  subset. 

An  important  type  of  mapping  space  is  described  in  Definition  2.6,  in  which  the  error  subset  can  be 
totally  eliminated  by  (16),  while  other  elements  Cj,-  ^  oj"*  are  not  influenced,  checking  set  fr’"  is  used  to  find 
out  whether  the  given  mapping  is  FP  —  1  local  mapping  spaice.  More  important  are,  for  loc^  f  '  P  —  1  mapping 
space  4*,  the  complexity  of  algorithm  relates  only  to  the  inversion  of  m  —  1  vectors  and  decision  of  Aj  instead  of 
searching  in  state  space,  this  is  a  non-recursive  and  analytic  approach. 

Let  us  look  at  example  of  mapping  1  again.  Assuming  m  =  5,m  —  I  =  4,j  =  1,  [Tj'"*]"’  corresponding  to 
in  the  first  neuron  of  subthird  layer  is  given  in  (17).  and  n  =  5,  r  =  10.  From  absolute  partial  order  error 
set  in  Tab.3,  the  error  subset  =  (-.5274,  .2328,  .213,  -.2074)^. 


^suti-l  _ 


-.4  -  .25  .19  .58 
.14  -  .25  .28  .18 
.49  .81  .22  -  .01 
.54  .82  .23  -  .02 


1.2856  -  1.8366  -42.8905  42.1995 
.0538  -  .6255  10.488  -  9.3149 
-2.8985  6.231  59.1355  -  57.6946 
3.5834  -3.5774  -44.5292  43.988 


iy*-‘  =  =  [-18.9949,  3.9922,  27.6068,  -21.332].  If  c  =  .001,  the  FP  -  1  areas  Aj  are 

rti  =  -.5284,  6,  =  -.2064,  «2  =  -212,  63  =  .2338. 

By  calculating  the  checking  set  =  YiW^-*=  (-.5274,  .2328,  .213,  -.2074,  1.2024,  .4832,  -18.0.361, 6.4508,  - 
.690.5],  where  the  generated  subset  d-j  =  (1.2024,  .4832,  —18.0361,  6.4508,  —.6905]  and  no  generated  elements 
ejj  €  fall  into  FP  —  1  areas.  The  result  of  adding  threshold  function  into  global  entity  in  the  third  layer  is  in 
Tab. 4. _ 


Tab.4 


target 

-.75 

-.15 

.45 

-.52 

.18 

-.38 

.09 

.8 

.53 

-.21 

app.result 

-.75 

-.15 

-.52 

k— - - 

-.0013 

-.5567 

.2442 

.6689 

.6415 

-.3208 

-.1542 


-.1115 


The  total  squared  error  is  .1298.  The  NDC  is  shown  in  Fig.8.  From  this  figure  the  degree  of  nonlinearity  has 
been  changed  and  become  closer  to  the  straight  line  and  more  smooth.  Hence,  the  nonlinear  threshold  function 
of  subthird  neuron  contributes  much  to  reduce  nonlinearity  and  errors  of  global  approximation.  In  the  instance 
of  mapping  1,  the  W^~*  with  respect  to  Vj’"*  and  «Tj"*  results  in  no  generated  elements  eu  €  getting  into 
FP  —  1  areas,  which  is  FP  —  1  local  mapping  space.  If  the  primary  local  mapping  is  not  FP  —  1  local  mapping 
space,  some  generated  elements  eu  G  getting  into  FP  —  1  areas  are  able  to  effect  the  previously  established 
approximation.  In  the  other  words,  the  unexpected  eu  €  dj  settled  into  FP  —  1  areas  make  the  corresponding 
error  eu  in  the  third  layer  increase.  The  strategy  used  in  FP  —  1  algorithm  is  to  compare  the  deleted  absolute 
values  of  errors  with  the  increased  errors.  If  (1/2)5^,  |eyf*|  >  ti'®  algorithm  of  FP  —  1  local  mapping 

space  is  available  continuously.  The  notion  of  half  FP  —  1  local  mapping  space,  therefore,  is  introduced. 
Definition  2,7;  Let  mapping  — »  /Zj  (i  =  l,...,r, m—  1  <  r)  be  non  FP  —  1  local  mapping 

space.  After  implementing  linear  mapping  (^,y^-,  14^“*,  A^)  in  definition  2.6,  there  must  exist  subset  crj"*  = 
[eji,  ...,ej(m-i)]  (m  —  1  <  r,  Cji  €  <Tj),  and  generated  subset  dj"*  =  [cji,  ...,ejt]  (ijt  G  &j),  and  Cju  gets  into 
FP  -  1  areas  Aj.  If  (1/2)  53,  given  mapping  —*  Hi  is  called  half  FP  —  1  local 

mapping  space,  and  is  denoted 

The  concept  of  the  half  FP  —  1  local  mapping  space,  in  fact,  provides  the  looser  condition  in  implementing 
FP  —  1  algorithm.  The  only  difference  between  'F  and  4'  is  that  4'  reduces  the  errors  less  than  ^  does.  From 
mathematics  point  of  view,  4'  reveals  more  intricate  topological  mapping  relationship.  Up  to  now  the  two  types 
of  primary  local  mapping  spaces  4*  and  4'  are  defined  and  the  corresponding  algorithms  are  also  provided.  The 
FP  —  1  is  based  on  these  two  primary  local  mapping  space.  Other  more  sophisticated  local  mapping  space  will 
be  introduced  in  the  future  discussion  of  FP  —  2. 

2.4  Determination  of  hidden  neurons 

If  the  process  above  can  not  reach  the  desired  accuracy,  the  hidden  neurons  should  be  added. 
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Definition  2.8:  If  a  hidden  neuron  is  only  connected  to  one  output  neuron  and  the  corresponding  weight  equals 
to  1,  this  hidden  neuron  is  denoted  the  direct  hidden  neuron  (DHN). 

From  FP—  1  basic  idea,  the  hidden  neuron,  in  fact,  only  contributes  to  the  local  area  for  the  specified  output 
node  in  the  process  of  approximation.  The  direct  hidden  neuron  actually  transfers  information  to  the  connected 
output  node.  Let  us  use  i.iapping  1  to  demonstrate  the  algorithm  in  hidden  layer. 

1.  The  remained  errors  in  Tab  .4  should  be  rebuilt  in  the  absolute  partial  order  error  subset  aj,  then  the 
first  n  maximum  absolute  values  in  ffj  are  selected  as  error  subset  to  be  reduced  through  hidden 

neuron,  here,  <rt“***  means  this  error  subset  eliminated  by  the  first  hidden  neuron  linked  to  jih  output 
node.  vector  with  n  variables  needs  to  be  decided.  Tab. 5  shows  the  rebuilt  absolute  partial  order 
error  set  and  n  =  5  in  mapping  1 . 


target 

.18 

-.38 

.09 

.8 

.53 

-.21 

-.75 

-.15 

.45 

-.52 

app.  result 

-.0013 

-.5567 

.2442 

.6689 

.6415 

-.3208 

-.75 

-.15 

.45 

-.52 

error 

.1813 

.1767 

-.1542 

.1311 

-.1115 

.1108 

.0 

.0 

.0 

.0 

Tab.5 


2. 


3. 


In  Tab.5,  error  subset  (rf"**'  =  [.1813,. 1767, -.1542, .1311, -.1115]'^.  The  FP  -  1  areas  Aj*  are  aj  = 
-.1543,  6i  =  -.ni4,  aj  =  .131,  62  =  .1814. 

Using  VV**  =  to  set  VU**,  which  is  weight  vector  between  input  and  first  hidden  neuron,  Xhi  is 

input  subset  X/n  6  X,  which  corresponds  to  With  same  algorithm  in  subthird  layer,  the  inversion 

Xfci'  and  can  be  gotten.  =  (.5788,  .1725, 1.9783,  .6433, -2.6756). 

Using  =  XW^^  to  check  out  whether  or  not  other  vectors  are  able  to  get  into  the  FP  —  1  areas 

Aj*.  If  not,  it  is  local  FP  —  1  mapping  space,  here,  the  <Tj*~*^*  is  the  checking  set  of  first  hidden  neuron 
in  first  output  node. 

=  [.1813,  .1767,  -.1542,  .1311,  -.1115,  -.2239.  -.7742,  -1.0811,  .4735,  1.9475]^.  No  generated 
elements  ejj  G  <r5*~*^*  —  crj"*''*  =  get  into  Aj*.  Therefore,  this  is  local  FP  —  1  mapping  space  and  the 


nonlinear  threshold  function  is  established  from  the  hidden  neuron  to  jth  output  node. 


i  =  l,...,r. 


/b=l 
u\-‘  = 


r  “1  <  i/?<  ‘  <  ^>1 
l  “2  <  y)r‘  ^  ^2 


otherwise 


(18) 


Because  it  is  FP  —  1  mapping  space,  this  nonlinear  threshold  function  only  allows  subset  passing 

through  the  hidden  neuron  to  eliminate  error  subset  in  the  third  layer,  then  in  the  jth  output  node  the  result  of 


target 

.18 

-.38 

.09 

.8 

.53 

-.21 

-.75 

-.15 

.45 

-.52 

app.  result 

.18 

-.38 

.09 

.8 

.53 

-.3208 

-.75 

-.15 

.45 

-.52 

error 

.0 

.0 

.0 

.0 

.0 

.1108 

.0 

.0 

.0 

.0 

The  total  squared  error  is  .0123.  ' 

DC  is  shown  in  Fig.9.  In  genera 

Tab. 6 


this  method  of  adding  hidden 
neuron  can  be  repeated  again  and  again,  until  the  given  precision  arrives.  Fig.  10  shows  the  architecture  of 
mapping  1  neural  network,  the  grey  circle  is  the  threshold  function. 

Definition  2.9:  With  the  given  approximation  precision,  if  all  of  the  local  mappings  from  second  to  third  layer 
and  all  of  direct  hidden  neurons  are  FP  —  1  local  mapping  space,  or  half  FP  —  1  local  mapping  space,  the  given 
mapping  ^  -+  i  =  1, ...,  r,  n  <  r  is  called  global  FP  —  1  mapping  space,  and  is  denoted  11. 

If  the  given  mapping  iZ"  — ►  fij"  is  global  FP—  1  mapping  space  11,  we  can  know  and  even  control  the  accuracy 
of  approximation  in  every  step,  and  therefore  we  are  able  to  control  the  entire  program  of  approximation. 


3  Remarks  and  Conclusion 

Due  to  the  page  limit,  we  can  not  provide  more  simulations,  the  mapping  1  is  only  a  simple  exam;  .  What  is  the 
meaning  of  FP—l  algorithm  in  mathematics?  Although  BP  is  used  broadly,  one  important  problem  is  not  given 
much  attention  by  researchers,  that  is  if  the  configuration  of  general  multilayer  perceptron  is  reasonable.  In  the 
other  words,  connecting  every  neuron  with  every  neuron  in  next  layer  is  unreasonable.  This  is  a  key  problem.  In 
fact,  an  arbitrary  nonlinear  mapping  can  be  viewed  as  a  global  mapping  and  its  local  constituents,  that  means 
some  neurons  play  the  roles  in  global  region,  while  others  focus  on  some  special  local  areas,  depending  on  the 
given  mapping  data  set  or  applications.  The  main  idea  of  FP  serial  algorithm  is  how  to  decompose  the  given 
mapping  into  global  and  local  mappings  and  to  decide  which  neurons  focus  on  which  areas.  It  is  undoubtable 
that  many  questions  will  arise  with  the  emergence  of  FP  —  1.  Some  of  them  need  to  be  proved  in  mathematics. 
The  FP  serial  algorithms  put  emphasis  on  the  way  of  how  to  realize  them. 


I 
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1 


Fig.l  The  mapping  of  -*  R  in 
two  layer  network 


linear  approximation  result 

Pig.2  Linear  and  approximation  result 
of  two  layer  nets  of  mapping  1 


8.  &  nil.  approximation  results 


Fig.3  Linear  and  nonlinear  approximation 
of  two  layer  nets  of  mapping  1 


Fig.7  Tk«  distribtttioa  of  thmkold  (unction  in  com  3 


Fig4  Naalinnr  dittribation  chart  Fig.9  The  nonlinear  distiibatioa 

in  third  layer  of  mapping  I  chart  aAer  adding  a  hidden  neuron 


rig.lO  The  architecture  of  adding  a 
bidden  neuron  in  three  layer  neU 
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ABSTRACT 

All  artificial  neural  nets  (ANN)  learning  schemes  assume  the  existence  of  an  untrained 
pool  of  neurons  whose  weights  are  modified,  so  that  after  training,  the  net  of  which  they  are  a 
part  may  solve  some  problem  -  pattern  recognition  for  exan^le.  But  biologically,  this  untrained 
pool  of  neurons  seems  a  contradiction.  In  a  living  organism,  these  untrained  neurons  serve  no 
survival  purpose  prior  to  their  training.  It  is  difficult  therefore  to  explain  their  presence.  It  seems 
highly  unlikely  that  a  single  mutation  simultaneously  produced  this  untrained  pool  and  their 
training  rule.  How  did  both  tmtrained  pool  and  the  training  rule  evolve?  I  discuss  the 
appUcability  of  a  two-layered  net  which  employs  probability  data  about  the  environment  as 
weights.  This  net  (or  a  similar  one)  might  be  an  appropriate  tool  to  help  answer  the  question 
posed  above.  I  have  seen  no  publications  in  the  ANN  literature  that  address  the  problem  of  how 
learning  could  have  evolved. 


I.  Introduction 


A.  History.  Reggia  [14]  explored  connectionist  models  which  enqployed  "virtual*  lateral  inhibition, 
and  included  die  activation  of  the  receiving  node  in  the  equations  for  the  flow  of  activation.  Ahuja  [2]  extended 
these  concepts  to  include  summing  the  total  excitatory  and  inhibitory  flow  into  a  node.  He  thus  introduced  the 
concqit  that  the  change  of  activation  of  a  node  depended  on  the  integral  of  the  flow  into  that  node  and  not  just  the 
present  activation  levels  of  the  nodes  to  which  it  is  coimected.  Both  Reggia’s  and  Abuja’s  models  used  probability 
data  for  the  weights.  Ahuja’s  model  was  further  extmded  by  Alexander  [4],  [5]  in  the  RX  model  to  allow  both  the 
weights  and  the  activations  of  Ahuja’s  model  to  be  negative,  and  further,  Alexander’s  model  included  the  prior 
probabilities  of  all  nodes.  Although  these  equations  have  been  discussed  elsewhere,  [4]  [S]  [6]  they  are  discussed 
here  in  light  of  the  new  use  for  which  they  are  proposed  in  this  article. 

B.  Overview.  Section  II  of  this  paper  contains  a  complete  listing  of  the  RX  equations  and  describes 
their  developmat.  The  convergence  of  the  system  is  discussed  in  Section  III.  Section  IV  briefly  describes  the 
experimeots  testing  the  RX  system.  Section  V  offers  a  conjecture  on  a  biologically  plausible  answer  to  the  question 
raised  above  by  enqiloying  the  RX  equations,  and  summarizes  and  concludes  this  article. 

n.  The  RX  Equations 


A.  Details  of  the  net  and  its  usage  The  net  is  two-layered,  with  the  J  lower  level  being  the  input 
nodes  to  the  N  upper  level  nodes.  The  processing  units  depict  a  "local"  or  one-unit-one-concept  representation. 
Thus  for,  the  net  has  been  used  in  pattern  recognition  applications,  lower  level  nodes  serving  as  features  and  the 
vppcr  level  nodes  representing  the  possible  pattern  classes.  The  values  of  the  upper  level  nodes  are  on  [0,1]  and 
called  a,(t).  The  prior  probability  of  the  ith  node’s  existence  is  called  Ti.  Values  of  ai(t)  greater  than  T;  indicate 
a  higher  than  avoage  chance  of  occuirmce  of  the  conc^t  presented  by  node  i,  and  those  lower,  indicate  a  less  than 
average  chance.  The  lower  level  nodes  (called  are  on  [-1,1].  Let  the  range  of  possible  input  values  to  a  node 
be  between  some  Minj  and  MaX|.  Call  the  average  Av«^.  ObserVj  be  the  observed  value.  Then  define: 


\ObserVj-AvtJi 

[MaXj-Ave^ 


If  Observ^Avtj 


(1) 
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Otherwise 


(2) 


[Observ.-Ave] 

,m - 1 - L. 

^  [Ave^-Min^ 


When  assumes  the  value  1,  then  the  feature  represmted  by  nOj  is  present,  with  probability  one.  When  n^  is 
the  feature  is  absent  with  probability  one.  A  value  of  zero  indicates  that  no  information  exists  concerning  the 
absence  or  presence  of  the  feature  (n^.) 

Two  sets  of  weights  exist.  The  first  is  called  Wy,  and  in  absolute  value  indicates  the  probability  of 
occurrence,  if  positive,  and  non-occurraice,  if  negative,  of  the  upper  level  node  (the  ith  node),  given  the  existence 
of  the  lower  level  node  (the  jth  node).  The  second  weight  is  called  Vy,  and  in  absolute  value  indicates  the 
probability  of  occurrmce,  if  positive,  and  non-occurrence,  if  negative,  of  the  lower  level  node  (the  jth  node),  given 
the  existence  of  the  upper  level  node  (the  ith  node). 

Two  auxiliary  functions  are  to  be  associated  with  each  ajft).  The  function  which  conveys  the  excitatory 
activation  is  called  Exq(t),  and  the  one  uiiich  conveys  the  inhibitory  activation  is  called  Inl^(t).  They  are  the  sums 
of  all  the  excitatory  and  inhibitory  flow  of  activation  into  node  a,(t).  The  strategy  used  in  defining  these  fimctions 
is  to  build  them  as  bounded  monotone-increasing  functions,  defined  by  their  derivatives.  The  monotone-increasing 
characteristic  of  the  Exc,(t)  term  will  be  achieved  by  defining  it  as  a  product  of  strictly  positive  terms  in  the  equation 
for  its  derivative.  One  of  these  terms  is  a  fimction  of  all  the  a|(t)  and  is  the  variable  forcing  term.  It  will  for  the 
present  be  called  ForcingFimction  and  the  equations  presented  shortly.  There  will  be  one  such  fimction  for  each 
lower  level  (j)  node,  hence  a  sum  need  be  taken  over  all  the  j  nodes  connected  to  any  given  i  node.  The  bounded 
characteristic  is  achieved  by  including  in  the  equation  a  factor  of  the  form  [N  -  Excj(t)].  The  choice  of  N  as  the 
bound  to  which  Exci(t)  approaches  is  somewhat  arbitrary.  The  differential  equation  defining  Exc,(t)  is  then: 

ForcingFimction^  0) 


Where  K,  is  a  constant  of  proportionality.  ForcingFunction  is  a  fimction  of  all  the  aj(t).  The  Ti  term  as  a  factor 
reflects  the  prior  probability  of  the  concept  represented  by  a|(t)  in  a  meaningful  way.  Inhi(t)  is  defined  in  a  similar 
fashion. 


Since  n^  may  be  above  or  below  average  (positive  or  negative)  and  the  weights  Wy  may  each  be  either 
positive  or  negative,  four  sqiarate  cases  nuy  occur.  Only  four  cases  can  occur,  since  the  signs  of  the  Vy  are  the 
same  as  the  Wy.  These  cases  are:  (1)  weight  positive  with  activation  positive;  (2)  weight  negative  with  activation 
positive;  (3)  weight  positive  with  activation  negative;  and  (4)  both  weight  and  activation  negative.  As  will  be 
shown,  all  activation  transfers  (both  excitatory  and  inhibitory)  will  be  calculated  as  positive,  with  the  inhibitory 
being  subtracted  from  the  excitatory  to  calculate  the  total  for  each  case  used  by  the  equations  in  calculating  the  a,(t). 
Each  of  die  four  cases  will  have  a  term  calculated  using  the  Wy  weights  (called  OUTy),  and  each  of  these  will  Imve 
a  term  of  opposite  sign  (offsetting)  calculated  with  the  Vy  weights  (called  outy). 

We  require  two  equations  per  upper  level  node  to  describe  the  change  of  activation,  one  udien  %(t)  is 
greater  than  T|,  and  one  when  less.  Thus  for  a((t)  >  a;’. 

*c,*ll  -a^i)]*lExc,(t)-lnhfi)]  (4) 

When  a,(t)  <  =  Tj 

dfi)~K,*c,*[  ajm*[Excf.t)-lnhfiyi  (5) 

Above  K3  is  a  constant  of  proportionality,  and  the  c,  and  c,  are  needed  to  keep  the  derivative  continuous 
at  T|.  The  values  of  Cj  and  c„  are: 
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c.-  1/0-7^, 
c,  -  1  /  a,. 

The  ccm^lete  list  of  equations  to  be  used  with  the  above  is  now  given. 

EiKff)^K^*a^^[N-Excff)\•Yi, 


bihfi)^Kt*Sf*[N-lnh^i)]  [itj,  *a^p^t) 

*kj2  *OUTPM^t)  *k^  *OUmP^ii  *-kj^  «oiitmm^r)] 


Hm  k„ . k24  are  included  for  generality. 


OUTPP^t) 


*itt, 


WhenW^^^^ 


oumtM- 


OUTUP^t) 


♦HI, 


<wn«f/o-J'**'°f**'°'L  »l»vl 


ou^p^t) 


_!]>„  VMo-«sl 

52K,*ia,(r)-a,| 

WunV^M/Xi 


*mj*\af,t)-a,\ 


mpmjiii 


E  >g*l<»i(0-a,| 

VnienV^^j<Q 


\mj\*\afi)-5,\ 


(8) 


(9) 


(10) 


(11) 


(12) 


(13) 
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WheHV^<OjH^ 


*mj*\a^O-a,\ 


(14) 


OHCmm^O 


Vl..o-?.i 

wiuHV^<o^j<o 


(15) 


As  writteo,  with  absolute  value  signs  included,  only  positive  addends  appear  in  die  sununadon  term,  which 
is  itself  a  multiplicand  in  the  equations  for  Exq(t)  and  Ii^(t),  hence  they  are  both  monotone  non-decreasing 
functions  as  required. 


We  first  offer  some  conunents  on  the  RX  system  of  equations  described  in  Section  II.  above.  The  RX  net 
is  more  biologically  plausible  than  a  perception.  Consider  the  following: 

a  -  Unexcited  cortical  neurons  spontaneously  fire  at  some  resting  average  rate  [1]  [10].  Information  is  only 
transmitted  udien  these  nrarons  vary  from  their  average  values.  This  attribute  is  simulated  in  die  RX 
equations  by  having  each  node  possess  an  average  value,  at  vidiich  it  assumed  to  be  operating.  Acdvadoa 
will  flow  only  when  die  activation  of  the  lower  level  nodes  is  above  or  below  their  averages. 


b  -  A  hi^y  excited  cortical  neuron  is  more  likely  to  fire  upon  the  arrival  of  an  excitatory  action  pulse  than  is  one 
at  its  resting  activation  level  [1].  This  prr^ieity  is  simulated  in  the  RX  equations  by  making  the  new 
activation  level  of  the  receiving  node  functionally  depend  on  the  product  of  ^  activations  of  die  sending 
(mp  and  die  recdving  nodes  (a|(t)). 

c  •  The  RX  system  calculates  the  activation  of  a  receiving  node  by  using  the  integral  of  excitatory  and  inhibitory 
irqwts  radier  than  by  the  instantaneous  values  of  the  nodes  from  which  it  receives  input.  It  is  for  this 
reason,  of  course,  tW  the  system  consists  of  3*N  rather  than  N  equations. 


The  RX  equations  handle  the  latmal  inhibitory  effects  (on-center,  off-surround)  of  neurons  in  the  same  layer 
by  "virtual*  lateral  inhibition  [14]  in  the  RX  system:  that  is,  intra-layer  inhibitory  effects  are  inqilemented  without 
actual  connections  between  these  nodes.  This  has  ovmcomes  two  problems  •  (1)  the  number  of  connections  grows 
rapidly  widi  N  (die  number  of  nodes  on  the  iqqier  level)  and  (2)  it  is  difficult  to  measure  (or  estimate)  these  weights. 

IIL  Convagenoe  of  the  RX  Equations.  The  demonstration  of  the  convergence  of  the  RX  equatiruis  is  given 

in  [4].  While  it  is  diown  duU  these  equations  are  convergent  almost  everywhere,  it  could  not  be  drown  that  they 
converge  in  an  e  volume  around  the  critical  points.  To  conqilicate  matters  even  more,  the  critical  points  form  a 
continuous  set.  As  is  to  be  expected  in  such  cases,  the  demonstration  of  convergence  is  long  uid  somewhat 
difficult.  Note  that  the  RX  set  contains  equations,  i.e.  N  terms  in  a|(t),  N  terms  in  Exc,(t)  and,  N  terms  in 
Inh,(t).  The  root  of  the  difficulty  is  that  the  first  N  terms  of  the  diagonal  of  the  Jacobian  matrix  for  the  RX  system 
are  zero,  (i.e.  N,  or  one  diird  of  die  eigenvalues  are  zero).  As  yet,  the  RX  equations  do  not  include  decay  terms. 
But  dieir  includon  would  gready  simplify  the  proof  of  crmvergence  since,  were  they  included,  all  terms  of  die 
riiagoiial  of  die  Jacobian  matrix  would  be  negative. 

IV.  Testiitg  of  the  RX  Equations.  Also  in  [4]  the  RX  equations  were  coded  and  this  system  extensively  tested. 

Testing  included  (1)  general  bdiavior  testing  in  which  their  predictirms  were  conqiared  wih  expected  behavior,  (2) 
accuracy  by  comparisons  to  special  cases  that  allowed  die  equations  to  be  integrated  in  closed  form,  (3)  comparison 
with  Minilar  runs  of  the  lA  (Interactive  Activatirm)  system  of  Rumelhart  and  McClelland  [11],  and  (4)  conqiarison 
of  die  results  of  a  radar  identification  problem  with  the  backpropagation  net  [16].  They  have  continued  to  undergo 
othm^  tests  of  thmr  utility  [7].  In  all  cases  the  RX  equations  ^owed  themselves  on  a  par  with  the  nets  to  which  they 
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wefe  bdiig  compand. 

V.  On  the  e?olution  of  intelligence.  Consider  again  the  question  raised  in  the  Abstract.  -  How  is  the 
existence  of  a  pool  of  untrained  neurons  to  be  explained  in  a  biologically  plausible  way?  The  pool  of  nmrons  serves 
no  survival  purpose,  prior  to  its  training. 

Kuffler,  Nichols  and  Martin  [10]  imply  that  the  brain  is  primarily  a  controller.  In  the  Outline  of  a  Theory 
of  Intelligence.  Albus  [3]  states  that  ”... first  and  foremost,  the  brain  is  a  controller.*  But  consider  the  history  of 
the  study  of  artificial  nraral  networks.  From  Rosenblatt  [IS]  in  Neurodvmunics  *At  the  time  the  first  perception 
was  proposed,  the  tvriter  was  primarily  concerned  with  the  problem  of  memory  storage  in  biological  systems...* 
Clearly  Minsky  and  Papert  [13]  were  concerned  with  memory,  since  their  surface  concern  was  Rosenblatt’s  book. 
The  recent  revival  of  the  connectionist  or  artificial  neural  network  field  started  afresh  in  1982  with  Hopfield’s  [9] 
paper  *NmiraI  Networks  and  Physical  Systems  with  Emergmt  Collective  Computational  Abilities.*  This  paper 
discussed  how  information  could  be  stored  in  a  certain  type  of  net  -  again  a  concern  about  memory.  Although  the 
trend  is  shifting  to  the  study  of  control  mechanisms  [17],  the  initial  studies  were  mainly  on  memories. 

Fossil  records  suggest  that  the  use  of  neurons  to  transmit  control  signals  existed  before  their  use  in 
memories.  It  could  be  argued  that  the  use  of  memories  differentiates  lower  life  forms  from  higher  ones.  In  the 
lower  forms,  the  input  or  signals  sensed  by  the  receptor  neurons  are  sent  directly  to  the  motor  neurons.  Little  or 
no  computation  is  performed  on  the  incoming  signal  prior  to  its  transmittal  to  the  motor  neuron  which  causes  some 
action  to  be  performed.  On  the  other  hand,  in  higher  forms  of  life,  from  sensors,  incoming  signals  are  sent  through 
sequences  of  cell  assemblies  where  qiecific  processing  is  performed  before  the  response  is  sent  to  the  motor  neurons 
to  control  actions.  Albus  [3]  has  suggested  that  it  is  in  the  ^lecific  processing  that  memories  might  be  needed. 
Stated  alternately,  lower  forms  of  life  re^nd  to  given  input  signals  in  a  predetermined  way  -  Ae  reqxMise  is 
hardwired.  In  higher  forms,  the  possible  eftects  of  various  bdiaviors  are  examined  in  view  of  the  existent 
environment,  and  die  most  iqipropriate  behavior  selected  and  executed. 


In  [S]  the  author  suggested  that  the  RX  system  of  equations  might  be  used  in  a  control  scenario  as  follows. 
Consider  the  net  diq;)layed  in  Fig  1.  The  lower  level  nodes  are  labeled  n^  and  represent  presence  (or  absence)  of 
an  external  event  or  stimulus  to  an  organism.  The  upper  level  nodes  represent  possible  responses  by  that  organism. 
Note  that  m,  is  excitatory  to  a,  and  inhibitory  to  aj,  while  ntj  is  excitatory  to  aj  and  i^bitory  to  a,.  Let  the 
reqmnses  be  mutually  exclusive  -  for  example,  the  decision  to  run  to  escape  from  a  predator  or  freeze  and  hope 
not  to  be  observed.  The  weights  w„,  w,2,  Wj,,  and  W22,  and  the 
activations  of  m|  and  m2  will  determine  what  action  the  organism 
will  take.  We  nuiy  ignore  the  v^  weights  for  the  present,  since  they 
will  not  change  in  udiat  follows.  As  an  alternative  to  the  exercise 
of  a  rule  to  modify  the  weights  (learning),  the  weights  may  be 
determined  as  follows.  We  assume  the  existence  of  some  statistical 
distributimi  of  these  weights  for  all  new  offspring.  Assuming  that 
the  organism’s  biological  neural  net  processes  in  a  similar  fa^on 
to  the  RX  net,  then  an  organism  possessing  the  weights  that  match 
most  closdy  the  actual  survival  probabilities  will  have  a  higher 
chance  of  survival  (and  hence  chance  to  r^roduce)  than  will  all 
others.  Further,  as.s'uming  the  weight’s  values  passed  on  in  the 
genes  will  be  distribute  with  a  mean  near  the  value  of  the  parents, 
as  generations  go  by  the  mean  of  the  population  weights  will  tend  to 
qiproach  dut  which  leads  to  the  highest  survival  probability. 

Clearly,  the  above  is  highly  qieculative;  however,  if  such 
fdiemmieoa  actually  exists,  perfaiqis  it  should  be  called  "Darwinian 
learning.” 


Now  assume  that  a  mutation  affords  the  connection  ftom  sensor  mj  to  one  of  the  actirm  nodes,  say  aj. 
Furdiar,  assume  that  m3  indicates  the  presence  of  an  event  that,  in  combination  with  the  presence  of  the  eveat 
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Mgnifiad  by  iDj,  oiikM  ■]  •  better  choice  than  e|,  i.e.  cootinuing  the  above  example,  the  ability  of  the  prey  to  amell 
the  predator  might  indicate  than  the  wind  is  from  predator  to  prey.  Hence  the  predator  probably  can’t  smell  the 
prey.  Then,  as  long  as  w,}  is  positive,  this  mutant  connection  will  tend  to  make  the  organism  bduive  in  a  way  that 
enhances  its  chance  of  survivid. 

In  [7]  die  author  studied  a  fiizzy  logic  controller  designed  to  determine  the  ^ipropriate  control  valve  setting 
required  to  "Mintain  water  at  a  constant  height  in  a  tank  with  vari^le  inflow  and  outflow.  In  this  study,  the 
defiizzified  centroid,  udiich  gave  the  value  for  the  aiqirc^ate  control  valve  setting,  was  subjected  to  increasing 
random  noise.  The  ultimate  test  in  this  series  was  to  r^lace  the  calculated  setting  with  a  positive  random  number 
on  the  range  of  control  valve  settings  if  the  defiizzified  setting  was  positive,  otherwise  a  negative  random  number 
on  this  range.  Even  under  these  extrane  conditions  the  controller  still  worked. 

Smniary.  From  die  above,  we  can  build  a  plausible  explanation  of  how  neurons  are  added  to  elementary 
organisms  with  small  populations  of  neurons.  Although  I  started  with  a  pool  of  two  in  the  example  above,  thm 
is  no  reason  uiiy  the  I  could  not  have  begun  with  just  one  neuron.  If  mutatirms  randomly  add  connections,  some 
of  these  connections  will  be  those  which  could  aid  in  survival.  Assuming  that  the  genes  will  pass  these  added 
connections  down  to  die  next  generation  with  random  variadmis  in  the  weights,  and  that  the  RX  net  is  a  close  model 
of  actual  biological  processing,  then  those  organisms  lucky  enough  to  inherit  weights  udiich  ctmtribute  to  their  fitness 
to  survive  will  have  a  better  chance  of  passing,  through  their  genes,  these  weights  on  to  their  offspring.  From  the 
example  of  control  of  the  height  of  water  in  the  tank,  we  see  that  just  a  little  guidance  can  help  a  great  deal  in 
control  problems.  I  suspect  that  this  is  true  in  cruse  of  evolution.  I  offer  the  following  answer  to  the  question  posed 
in  the  abstract.  Organisms  possessing  neurons  serving  control  functions  might,  by  mutation,  beneficially  combine 
the  afRsrant  sigiuls  of  sensory  neurons  or  intemeurons  in  the  control  paths  and  thereby  inqirove  the  survival 
probability  of  the  organism.  The  effect  of  this  mutation  is  to  produce  a  more  highly  coupled  or  connected  set  of 
neurons.  This  increased  crmnectivity  might  allow  the  emergence  of  memory  in  addition  to  the  enhanced  control 
function. 

Discussioa.  The  above  is  highly  speculative.  "Much  of  what  has  been  presented  is  hypothesis  and 
argument  by  arudogy*  [3].  No  biologicrd  evidence  exists  for  any  of  the  hypotheses  presented  in  Section  V.  The 
reason  for  exanqiles  being  demonstrated  with  the  virtually  unknovm  RX  net  of  the  audior’s  familiarity  with  this  net 
(his  dissntadcm  work).  Certainly,  more  biologically  plausible  models  are  readily  available  [12].  Perhaps  further 
review  will  reveal  that  major  biological  implausibilities  exist  in  the  above. 

Yet,  the  author  feels  there  is  merit  in  woik  such  as  the  above.  Daniel  Gardner  [8],  and  the  editors  of  the 
series  of  which  this  book  is  a  member,  posit  the  theme  that  "Third  Generation  Neural  Networks  Should  be 
Neuromorphic*.  A  step  towards  biological  plausibility  seems  a  step  in  the  right  direction. 
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ABSTRUCT  This  paper  discusses  properties  of  activation  functions  in  multilayer  neural  network 
applied  to  multi-frequency  classification.  A  rule  of  thumb  for  selecting  activation  functions  or  their 
combination  is  proposed.  The  sigmoid,  Gaussian  and  sinusoidal  functions  are  employed  due  to 
their  unique  space  division  properties.  Properties  of  each  function  and  their  combinations  are 
discussed  based  on  the  internal  representation,  that  is  the  distributions  of  the  hidden  unit  inputs 
and  outputs  and  classification  rates  with  and  without  noise.  The  sigmoid  function  is  not  effective 
for  a  single  hidden  unit.  On  the  contrary,  the  other  functions  can  provide  good  performance.  When 
several  hidden  units  are  employed,  the  sigmoid  function  becomes  useful.  However,  the  convergence 
speed  is  still  slower  than  the  others.  The  Gaussian  function  is  sensitive  to  the  additive  noise,  while 
the  others  are  rather  insensitive.  When  noise  is  not  included,  the  Gaussian  function  is  most  useful 
for  the  convergence  rate  and  the  classification  accuracy.  On  the  other  hand,  the  additive  noise  is 
included,  the  sigmoid  and  sinusoidal  functions  become  more  effective.  These  properties  are  not 
straight  in  the  combinations.  However,  their  property  still  remain,  and  it  is  possible  to  select  the 
optimum  activation  function.  This  selection  also  depends  on  the  patterns  to  be  classified. 

function,  a  radial  basis  function(2]  and  a  periodic 
function.  They  will  be  compared  with  each  other  in 
classifying  multi-frequency  signals.  Effects  of  noisy 
signals  wiU  be  also  discussed  in  the  training  and  clas¬ 
sification  processes. 

As  a  result,  a  rule  of  thumb  for  selecting  the  suit¬ 
able  functions  and  the  combination  of  several  kinds 
of  functions  will  be  provided. 

II  MULTI-FREQUENCY  SIGNALS 
Multi-frequency  signals  are  defined  by 

R 

x,„(n)  =  ^  Amr  sia(u>,,nr-f- ^mr)  (1) 

rsl 

n  =  1  ~  JV  ,  Ufr  =  2ir/,p 

T  is  a  sampling  period.  M  samples  of  Xpm{n),  m  = 
1  M  ,  are  included  in  the  group  Xp  as  follows. 

A,  =  {x,m(n),m  =  1  ~  M},p=  1  ~  P  (2) 

In  one  group,  the  same  frequencies  are  used. 

F,  =  [f,,,fpt,...,fpR]Hz,p=l'^P  (3) 

Amplitude  Amr  &nd  phase  urc  generated  as 
random  numbers,  uniformly  distributed  in  following 
ranges. 

0  <  Amr  <  1,  0  <  4>n,r  <  2»  (4) 

taken  into  account.  They  include  a  sigmoid 


I  INTRODUCTION 

Advantage  of  multilayer  neural  networks  (NNs) 
trained  by  the  back-propagation  (BP)  algorithm  is  to 
extract  common  properties,  features  or  rules,  which 
can  be  used  to  classify  data  included  in  several 
groups  [1].  Especially,  when  it  is  difficult  to  ana¬ 
lyze  the  common  features  using  conventional  meth¬ 
ods,  the  supervised  learning,  using  combinations  of 
the  known  input  and  output  data,  becomes  very  use¬ 
ful. 

We  studied  the  multi-frequency  signal  classifica¬ 
tion  using  multilayer  neural  uetwork[5]-[7].  Since 
the  frequencies  are  assigned  alternately  to  several 
groups,  it  is  very  difficult  to  distinguish  the  wave¬ 
forms  within  a  short  period,  and  the  limited  number 
of  samples  by  conventional  methods.  The  follow¬ 
ing  advantages  of  the  NN  over  conventional  methods 
were  confirmed.  The  neural  network  can  classify  the 
signak  using  a  small  number  of  samples  and  a  short 
observation  period  with  which  Fourier  transform  can 
not  classify.  The  number  of  calculation  is  sufficiently 
smaller  than  the  convolution  calculation,  required  in 
digital  filters. 

In  the  previous  work,  a  sigmoid  function  was  used. 
However,  it  is  not  always  optimum.  Therefore,  prop¬ 
erties  of  activation  functions  are  investigated  in  this 
paper.  For  this  purpose,  some  typical  functions  are 
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Ill  MULTILAYER  NEURAL  NETWORK 

3.1  Network  Structure  and  Equations 

A  single-layer  neural  network  is  taken  into  account. 
N  samples  of  the  signal  Xpm(u)  are  applied  to  the 
input  layer  in  parallel.  The  nth  input  unit  receives 
Zpm(u).  Connection  weight  from  the  nth  input  to 
the  jth  hidden  unit  is  denoted  w„j.  The  input  and 
output  of  the  jth  hidden  unit  are  given  by 
s-\ 

|»0 

yi=fH{neij)  (6) 

Letting  the  connection  weight  from  the  jth  hidden 
unit  to  the  kth  output  unit  be  wjk,  the  input  and 
output  of  the  kth  output  unit  are  given  by 

j-i 

netk  =  ^  ti'jky,  Ok  (7) 

J=0 

»t  =  /o(nc<t)  (8) 

The  activation  function  of  the  output  layer  is  the 
sigmoid  function. 

The  number  of  output  units  is  equal  to  that  of 
the  signal  groups  P.  The  neural  network  is  trained 
so  that  a  single  output  unit  responds  to  one  of  the 
signal  groups. 

3.2  Training  and  Classification 

Signals  are  categorized  into  training  and  untrain¬ 
ing  sets,  denoted  Xtp  and  Xup,  respectively.  Their 
elements  are  expressed  by  iTpm(n)  and  xirpm(n),  re¬ 
spectively. 

The  neural  network  is  trained  by  using  z;rpm(n), 
m  =  1  ~  Mt,  for  the  pth  group.  Here,  Mr  is  the 
number  of  the  training  data.  After  the  training  is 
completed,  the  untrained  signals  X(;p,n(n)  are  applied 
to  the  NN,  and  the  output  is  calculated.  For  the 
input  signal  xupm(a),  if  the  pth  output  yp  has  the 
maximum  value,  then  the  signal  is  exactly  classified. 
Otherwise,  the  network  fails  in  classification. 

IV  SELECTION  OF  ACTIVATION 
FUNCTIONS 

What  kinds  of  activation  functions  should  be  se¬ 
lected  is  very  important.  At  the  same  time,  it  is  a 
very  difficult  problem.  In  this  paper,  the  following 
typical  functions  are  selected  for  the  hidden  layer. 

When  binary  target  can  be  considered,  then  the 
sigmoid  function  can  be  used  in  the  output  layer. 

Sigmoid  function: 

yj  =  /.i,(nct>)  =  (9) 


Sinusoidal  function: 


yj  =  f.,n{netj)  =  sin(jrnetj) 

(10) 

Gaussian  function: 

yi  =  ffineij)  =  e"*''? 

(11) 

The  input  vectors  are  distributed  in  a  N- 
dimensional  space.  Three  functions  divide  the  space 
as  follows: 

(12) 

t  |net,  -  (2nir  +  f)l  <  r..« 

<0-,  Inet,  -  (2nx -t- |x)|  <  r.,n 

(13) 

f  f  t  \/  >  “+>  jnetjl  <  Tj.b 

U.Anet,)l 

(14) 

Here,  n  is  integer. 

These  space  division  fundamental,  and  indepen¬ 
dent  to  each  other.  This  is  an  idea  behind  selecting 
the  above  three  functions. 

Next  step  of  selecting  activation  functions  is  how 
to  combine  them.  It  is  also  highly  dependent  on  the 
distribution  of  the  input  signals,  and  is  very  hard 
to  determine  before  hand.  For  this  reason,  both  the 
homogeneous  function  and  the  composite  functions 
are  investigated. 

V  SIMULATION  OF  TRAINING  AND 
CLASSIFICATION  WITHOUT  NOISE 

5.1  Multi-frequency  Signals 

The  number  of  frequency  components  is  R  =  3, 
and  the  signal  groups  is  P  =  2,  respectively.  The  fre¬ 
quency  components  are  located  alternately  between 
the  groups  as  follows:  Fj  =  [1,  2,  3]  Hz  for  Group  1 
(#1)  and  F2  =  [1.5,  2.5,  3.5]  Hz  for  Group  2  (#2). 
The  sampling  frequency  is  10  Hz,  that  is  T  =  0.1 
sec.  The  number  of  samples  N  is  10.  Therefore,  the 
observation  interval  is  1  sec. 

5.2  Training  and  Classification 

a:Tpm(n),  m  =  1  ~  200  and  ii;pm(n),  m  =  1  ~ 
1800  are  used.  Simulation  results  are  shown  in  Table 
1.  The  training  converged  using  three  hidden  units 
for  all  activation  functions.  In  the  case  of  the  Gaus¬ 
sian  and  the  sinusoidal  function,  the  training  almost 
converged  with  one  hidden  unit.  Detailed  discussion 
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will  be  provided  in  Sec.  7. 


Table  l:Classification  rates  by  three  functions(%| 
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VI  SIMULATION  USING  THREE 
ACTIVATION  FUNCTIONS 

6.1  Additive  Noise 

White  noise,  denoted  noise(n),  is  generated  as  ran¬ 
dom  number,  and  is  added  to  the  signal  Xpm(n). 
Noisy  signal  Xp„,(n)  is  given  by 

I'pmin)  =  x,„,{n) +  noise{n)  (15) 

6.2  rnraining  and  Classification 

The  noisy  multi-frequency  signals  are  used  for 
training.  N  is  10  and  M  is  200  for  each  group.  After 
training,  uutraining  signals  with  white  noise  are  ap¬ 
plied,  and  classification  rates  are  evaluated.  White 
noise  is  uniformly  distributed  in  the  range  ±0.5.  The 
results  are  shown  in  Table  2.  Columns  with  (A)  and 
(B)  list  the  recognition  rates  using  the  training  .sig¬ 
nals  without  and  with  white  noise,  respectively.  The 
NN  trained  without  noise  is  also  used  for  comparison. 
From  these  results,  it  can  be  confirmed  that  training 
using  noisy  signals  is  useful  to  achieve  robustness. 

Table  2:  Classification  rates  using  training  signals 


(A)  without  and  (B)  with  white  noise 

%1 

Activation 

Function 

(A) 

(f 
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E£a 

IE2B 

Sigmoid 

1 

imSEM 

E2EI 

BiSiS 

3 

mmm 

Esa 

Sinusoidal 

1 

|E^!£| 

EUkfl 

3 

Gaussian 

1 

ESM 

aoQi 

3 

mnwm 

6.3  Convergence  Rates 

Figure  1  shows  learning  curves  obtained  using  the 
three  hidden  units.  The  NN  with  the  Gaussian  func¬ 
tion  can  converge  faster  than  the  other.  However,  the 
error  does  not  well  decreased.  The  NN  with  the  sinu¬ 
soidal  function  can  also  converge  faster.  At  the  same 
time,  the  error  can  be  well  decreased.  A  convergence 
rate  using  the  sigmoid  function  is  slow.  However, 
the  error  can  reach  to  the  same  level  as  in  using  the 
sinusoidal  function. 


VII  Convergence  Property  Using  Single 
Hidden  Unit 

7.1  Pure  Multi- frequency  Signals 

The  NNs  trained  without  noise  are  further  inves¬ 
tigated  by  hidden  unit  input  and  output  distribu¬ 
tion.  Figure  2  illustrates  this  distribution,  using  the 
sigmoid  (al),  the  sinusoidal  (bl)  and  the  Gaussian 
functions  (cl). 

In  the  case  of  the  sigmoid  function,  the  data  #1 
and  the  data  #2  have  to  be  located  the  right  or  left 
side.  This  is  a  fundamental  space  division  property 
of  the  sigmoid  function.  Thus,  the  network  have  to 
adjust  the  weights,  with  which  the  hidden  unit  in¬ 
put  data  are  completely  separated  into  the  right  or 
the  left  side.  The  data  #2  is  concentrated  at  the 
edge  of  the  as  shown  in  Eq.(12),  but  the  data  #1 
is  distributed  widely.  From  this  result,  the  distri¬ 
bution  of  the  hidden  unit  inputs  generated  by  the 
multi-frequency  signals  cannot  satisfy  the  require¬ 
ments  given  by  Eq.(12). 

In  the  case  of  the  sinusoidal  function,  the  hidden 
unit  inputs  of  the  data  #2  locate  near  one  of  the 
peaks  and  the  data  ^1  distributed  widely.  The  sinu¬ 
soidal  function  have  large  differential  coefficient  ex¬ 
cept  for  the  peak.  Then  the  data  ^2  can  be  shifted 
around  one  of  the  peaks  fast.  On  the  other  hand,  the 
data  #1  can  locate  in  the  region  of  fsininetj)  <  a_. 
Therefore,  the  requirement  of  the  fundamental  di¬ 
vision  property  given  by  Eq.(13)  is  satisfied  by  the 
multi-frequency  signals. 

In  the  case  of  the  Gaussian  function,  the  data 
#2  locate  around  the  peak.  Differential  coefficients 
around  the  peak  are  large,  then,  the  data  #2  can  be 
shifted  toward  this  area  very  fast.  Most  of  the  data 
#1  are  distributed  both  sides. 

From  these  results,  the  hidden  unit  inputs  of  the 
multi-frequency  signals  can  be  concentrated  on  a  nar¬ 
row  range  for  one  group,  and  the  other  is  distributed 
widely  for  the  other  group. 

Thus,  the  space  division  property  of  the  Gaussian 
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Figure  2;  Hidden  unit  input  and  output  distributions 


function  is  best  match  with  the  distribution  of  the 
multi-frequency  signals.  This  function  can  provide 
the  best  accuracy  as  shown  in  Table  1. 

7.2  Noisy  Multi-frequency  Signals 

In  Figure  2  ,  (a2),  (b2)  and  (c2)  correspond  to 
the  hidden  unit  inputs  and  output  distributions,  in 
which  random  noise  is  added.  The  network  is  trained 
by  using  the  pure  multi-frequency  signals.  After  the 
training,  the  untrained  noisy  signals  are  applied  to 
the  NN.  The  distribution  of  the  hidden  unit  inputs 
are  easily  spread  by  adding  the  noise. 

In  the  case  of  the  sigmoid,  the  data  #2  distributed 
widely.  However,  the  most  of  the  data  #2  still  re¬ 
main  in  its  own  region.  Because  it  has  wide  stable 
regions.  This  is  a  reason  why  it  can  provide  better 
accuracy  than  the  others. 

In  the  case  of  the  Gaussian,  the  data  #2  dis¬ 
tributed  over  the  other  region.  Because  a  single  peak 
is  very  narrow.  Then  these  data  easily  move  over  the 
other  group’s  region.  Thus,  the  accuracy  is  decreased 
by  adding  the  noise. 

The  sinusoidal  case,  the  data  #2  also  widely  dis¬ 
tributed.  However,  the  sinusoidal  function  is  a  peri¬ 
odic  function,  having  several  narrow  stable  regions. 
Thus,  it  can  provide  higher  accuracy  than  that  of  the 
Gaussian  function. 

VIII  Convergence  Property  Using  Several 
Hidden  Units 

8.1  Homogeneous  Activation  Funtions 

Figures  3,  5  and  7  show  distributions  of  the  hidden 
unit  inputs  and  outputs.  The  NNs  are  trained  by 
using  the  signals  without  noise.  The  sigmoid,  the 
sinusoidal  and  the  Gaussian  functions  are  separately 
used.  For  each  figure,  (a),  (b)  and  (c)  correspond  to 
one  of  the  hidden  unit,  (al),  (bl)  and  (cl)  are  the 


response  for  the  data  #1,  and  (a2),  (b2)  and  (c2)  are 
for  the  data  #2. 

From  these  figures,  there  are  two  type  of  distri¬ 
butions,  that  is  concentrated  and  dispersed  distribu¬ 
tions.  One  of  two  groups  locates  at  near  the  peak 
of  the  functions  and  the  other  is  widely  spread.  The 
overlap  of  the  distributions  between  the  two  groups 
cause  miss  classification. 

In  Fig.3,  it  is  very  interesting  that  the  data  #2 
locate  at  the  middle  of  the  slope.  Since  this  region  is 
not  a  stable  region,  it  can  be  expected  that  accuracy 
is  easily  degraded  by  adding  the  noise.  As  shown  in 
Table  2,  it  is  true.  The  classification  rates  are  97.3% 
for  the  data  #1  and  8.4%  for  the  data  #2.  Accuracy 
for  the  data  #2  is  greatly  reduced. 

Figures  4,  6  and  8  show  distribution  of  the  inputs 
of  the  two  output  units.  In  these  figures,  (a)  and 
(b)  correspond  to  the  data  #1  and  the  data  #2,  re¬ 
spectively.  The  region  of  overlap  of  the  solid  and 
the  doted  lines  will  cause  miss  classification.  We  can 
investigate  from  these  figures,  how  the  hidden  units 
separate  the  signals  into  two  groups.  In  the  case  of 
the  data  #2  is  applied,  there  are  no  overlap.  So, 
the  hidden  unit  input  space  is  well  separated.  In  the 
case  of  the  data  #1  is  applied,  there  are  some  over¬ 
lap.  These  overlaps  cause  miss  classification.  These 
results  are  consistent  with  the  accuracies  shown  in 
Table  1. 

From  the  figures,  the  input  space  of  the  output 
units  are  well  separated  by  the  sigmoid  and  sinu¬ 
soidal  function.  So,  it  can  be  concluded  that  three 
hidden  units  cooperate  to  make  the  distribution  of 
the  inputs  to  the  output  unit  to  be  linearly  separa¬ 
ble. 

8.2  Composite  Activation  Functions 

Three  functions  can  be  combined  in  the  same  hid- 
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den  layer.  This  combination  is  called  'Composite  Ac¬ 
tivation  Function'  in  this  paper. 

Table  3  shows  classification  rates  using  the  multi¬ 
frequency  signals  without  noise.  In  this  table,  the 
symbols  D  through  J  correspond  to  the  combination 
of  the  functions. 

The  combination  C,  having  three  Gaussian  func¬ 
tions,  achieves  the  best  accuracy.  The  conver¬ 
gence  rate  is  also  the  fastest  among  three  functions. 
The  combination  D,  having  all  activation  functions, 
achieves  better  accuracy  than  the  others  except  for 
C.  However,  I  and  J,  which  include  two  Gaussian 
functions,  are  worse  than  D. 

K  through  M  are  compared  with  E  through  J.  E 
and  F  are  better  than  K.  Then  adding  both  the  sinu¬ 
soidal  and  the  Gaussian  to  the  sigmoid  can  improve 
the  performance.  H  is  better  than  L,  but  G  is  worse 
than  L.  Then  adding  the  Gaussian  to  the  sinusoidal 
can  improve,  while  the  sigmoid  can  not  do. 

In  the  most  of  the  combinations,  the  Gaussian 
achieves  better  accuracy.  Then,  property  of  each 
function  does  not  appear  straightly  in  the  combina¬ 
tions. 

Table  4  shows  classification  rates  of  the  network 
trained  using  the  noisy  signals.  Training  itself  did 
not  converge  in  all  cases.  This  means  that  the  accu¬ 
racy  is  not  100%  for  all  combinations  of  the  functions. 
The  network  using  the  homogeneous  activation  func¬ 
tion  A  and  B  has  higher  accuracy  than  the  others. 
However,  C  does  not  achieve  better  accuracy  than 
the  others.  Then  the  homogeneous  activation  func¬ 
tion  can  not  always  achieve  better  accuracy  than  the 
composite  activation  functions. 


than  E  and  F.  Then  adding  both  the  sinusoid<il  and 
the  Gaussian  to  the  sigmoid  does  not  work  well. 

The  sinusoidal  and  sigmoid  functions  achieve  good 
accuracy  in  the  most  of  the  combinations.  However, 
the  sinusoidal  combination  does  not  always  achieve 
better  accuracy.  Thus,  property  of  each  function  is 
not  straight  in  the  combination,  as  previously  dis¬ 
cussed  in  the  no  additive  noise  case. 


Table  4:  Claasification  rates  using  signals  with  noise 
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IX  CONCLUSIONS 
Properties  of  the  activation  functions  for  multi¬ 
frequency  signal  classification  has  been  discussed  us¬ 
ing  multilayer  neural  network  supervised  by  BP  algo¬ 
rithm.  The  Gaussian  function  can  provide  the  high¬ 
est  performance  for  the  signals  without  noise.  How¬ 
ever,  it  is  sensitive  to  the  additive  noise.  The  sig¬ 
moid  function  is  not  useful  for  a  single  hidden  unit. 
If  several  hidden  units  are  used,  then  the  sigmoid 
function  becomes  useful,  and  is  insensitive  to  the  ad¬ 
ditive  noise.  The  sinusoidal  function  is  useful  for 
noisy  signal. 
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provides  good  one. 

K  through  M  are  compared  with  E  through  J.  G 
and  H  are  better  than  L.  Then  adding  the  sigmoid  or 
the  Gaussian  to  the  sinusoidal  works  well.  K  is  better 
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Figure  3:  Distribution  of  sigmoid  hidden  unit  outputs 


Figure  4:  Distribution  of  output  unit  inputs 
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Figure  5:  Distribution  of  sinusoidal  hidden  unit  outputs 


Figure  6;  Distribution  of  output  unit  inputs 
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Figure  7:  Distribution  of  Gaussian  hidden  unit  outputs 


Figure  8:  Distribution  of  output  unit  inputs 
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